Community
    • Login

    Delete line with duplicate Number

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 1.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jim ErlichJ Offline
      Jim Erlich
      last edited by

      **Sorry for the repost, going to try and simplify my question. **

      I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.

      …Random words, letters, and numbers are on each line…

      Objective: find lines that have exact duplicate numbers (not letters or words).

      Before example:

      A dog went to the mall - #11364
      The dog went to the store - #11364
      A dog is at the mall - #14369
      Dog to the store random - #14369
      Sentence a random - #13677
      The went dog to store - #11159

      After example:

      A random sentence - #11364
      A sentence random - #14369
      Sentence a random - #13677
      The went dog to store - #11159

      • The formula needs to at least: match lines that have identical numbers.
      • The formula does NOT need to: delete one of the lines

      I’m fine with manually deleting the lines that have an identical number match.

      Any help is appreciated, thank you

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by

        Hello, @jim-erlich,

        Before finding a way to solve your problem, we need additional information :

        • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

        • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

        • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

        • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

        • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

        • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

        Note that we can cope with a possible space character between the # symbol and the number Not a problem !

        See you later !

        Best regards,

        guy038

        Jim ErlichJ 1 Reply Last reply Reply Quote 2
        • Jim ErlichJ Offline
          Jim Erlich @guy038
          last edited by

          @guy038 said in Delete line with duplicate Number:

          Hello, @jim-erlich,

          Before finding a way to solve your problem, we need additional information :

          • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

          • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

          • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

          • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

          • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

          • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

          Note that we can cope with a possible space character between the # symbol and the number Not a problem !

          See you later !

          Best regards,

          guy038

          • 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
          • Right now the lines are in Ascending order based upon the number.

          For example:
          The dog went to the park - #4599
          The cat went to the park - #4657
          The kid went to the park - #4797
          The lizard went to the zoo - #5100
          The cat went to the zoo - #5120
          etc…

          Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers

          • The number of digits after the # symbol is 1 to 5… the highest number being about 14000
          • Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this

          For example:
          The dog went to the park - #12554
          The cat went to the park - #12554

          ^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.

          • There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
          • Keep the last duplicate line, delete the first

          I hope this is clear, let me know if there is anything else.

          Thank you

          1 Reply Last reply Reply Quote 1
          • guy038G Offline
            guy038
            last edited by guy038

            Hi, @jim-erlich and All,

            Many thanks for all your information ! It should be very easy to get the right solution !

            The most important points are :

            • Your file is already sorted by ascending #number

            • And, in case of 1 duplicate line, it is located right after the original line !


            Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

            For instance :

            Are lines sorted, as below ( case A ) :

            The dog went to the park - #4599
            The cat went to the park - # 4657
            The kid went to the park - #4797
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5120
            

            OR

            Are lines sorted, like ( case B ) :

            The cat went to the park - # 4657
            The lizard went to the zoo - # 5100
            The dog went to the park - #4599
            The kid went to the park - #4797
            The cat went to the zoo - #5120
            

            Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


            Anyway, assuming a sort, like in case A and the initial text :

            The dog went to the park - #4599
            You went to the zoo - # 4640
            He went to the park - # 4640
            The cat went to the park - #4657
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5100
            I went to the park - #7500
            We went to the zoo - #7500
            They went to the park - #14000
            

            Here is the road map :

            • Open your file in N++

            • Open the Replace dialog ( Ctrl + H )

            • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

            • REPLACE Leave EMPTY

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Click once on the Replace All button or several times on the Replace button

            Voila !

            You should get your expected list :

            The dog went to the park - #4599
            He went to the park - # 4640
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The cat went to the zoo - #5100
            We went to the zoo - #7500
            They went to the park - #14000
            

            Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

            Next time, if everything was OK, I’ll explain how this regex S/R works !

            Best Regards,

            guy038

            Jim ErlichJ 1 Reply Last reply Reply Quote 1
            • Jim ErlichJ Offline
              Jim Erlich @guy038
              last edited by

              @guy038 said in Delete line with duplicate Number:

              Hi, @jim-erlich and All,

              Many thanks for all your information ! It should be very easy to get the right solution !

              The most important points are :

              • Your file is already sorted by ascending #number

              • And, in case of 1 duplicate line, it is located right after the original line !


              Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

              For instance :

              Are lines sorted, as below ( case A ) :

              The dog went to the park - #4599
              The cat went to the park - # 4657
              The kid went to the park - #4797
              The lizard went to the zoo - # 5100
              The cat went to the zoo - #5120
              

              OR

              Are lines sorted, like ( case B ) :

              The cat went to the park - # 4657
              The lizard went to the zoo - # 5100
              The dog went to the park - #4599
              The kid went to the park - #4797
              The cat went to the zoo - #5120
              

              Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


              Anyway, assuming a sort, like in case A and the initial text :

              The dog went to the park - #4599
              You went to the zoo - # 4640
              He went to the park - # 4640
              The cat went to the park - #4657
              The kid went to the park - # 4657
              The girl went to the park - #4900
              The lizard went to the zoo - # 5100
              The cat went to the zoo - #5100
              I went to the park - #7500
              We went to the zoo - #7500
              They went to the park - #14000
              

              Here is the road map :

              • Open your file in N++

              • Open the Replace dialog ( Ctrl + H )

              • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

              • REPLACE Leave EMPTY

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click once on the Replace All button or several times on the Replace button

              Voila !

              You should get your expected list :

              The dog went to the park - #4599
              He went to the park - # 4640
              The kid went to the park - # 4657
              The girl went to the park - #4900
              The cat went to the zoo - #5100
              We went to the zoo - #7500
              They went to the park - #14000
              

              Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

              Next time, if everything was OK, I’ll explain how this regex S/R works !

              Best Regards,

              guy038

              Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by

                Hello @jim-erlich and All,

                Sorry for being late ! So, here are, below, some explanations about my regex S/R :

                SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

                REPLACE Leave EMPTY

                • First, the (?-s) in-line modifier ensures that any further . regex symbol corresponds to a single standard character, only and not to a line-break char !

                • So, the next part ^.+#\x20? searches, from beginning of line ( ^ ), any non-null range of characters ( .+ ), followed by the # symbol and an optional space char (\x20?)

                • Then, it looks for a non-null range of digits ( \d+ ), followed by line-break character(s)

                • So, the regex engine looks for an entire line ( digits after the # are stored as group 1 as embedded in parentheses ) but ONLY IF the next line ends with the same number !

                • This condition can be expressed with a look-ahead structure (?=......) which are rather a user assertion in the same way that, for instance, the $ symbol is a system assertion, looking for the zero length assertion “end of line” !

                • So current line must be followed with the regex .+#\x20?\1, which represents, again, a non-null range of standard characters followed with a # and possibly a space char and finally the group 1 ( \1 ) which is the ending number of the current line

                • Note that the ^ assertion for the second line, in the look-ahead structure, is useless as the range (.+) comes next the line-break char(s) \R, anyway !

                • As the replacement zone is empty, the current line, with its line-break, is just deleted


                For a quick oversight about regular expressions, see the N++ documentation, below :

                https://npp-user-manual.org/docs/searching/#regular-expressions

                See also the main links regarding the Boost regex library, used by the regex N++ engine :

                https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

                https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

                Finally, see this FAQ topic about regular expressions :

                https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0

                Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                With your input, this post could be even better 💗

                Register Login
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors