Community
    • Login

    Delete line with duplicate Number

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 588 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jim ErlichJ
      Jim Erlich
      last edited by

      **Sorry for the repost, going to try and simplify my question. **

      I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.

      …Random words, letters, and numbers are on each line…

      Objective: find lines that have exact duplicate numbers (not letters or words).

      Before example:

      A dog went to the mall - #11364
      The dog went to the store - #11364
      A dog is at the mall - #14369
      Dog to the store random - #14369
      Sentence a random - #13677
      The went dog to store - #11159

      After example:

      A random sentence - #11364
      A sentence random - #14369
      Sentence a random - #13677
      The went dog to store - #11159

      • The formula needs to at least: match lines that have identical numbers.
      • The formula does NOT need to: delete one of the lines

      I’m fine with manually deleting the lines that have an identical number match.

      Any help is appreciated, thank you

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by

        Hello, @jim-erlich,

        Before finding a way to solve your problem, we need additional information :

        • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

        • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

        • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

        • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

        • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

        • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

        Note that we can cope with a possible space character between the # symbol and the number Not a problem !

        See you later !

        Best regards,

        guy038

        Jim ErlichJ 1 Reply Last reply Reply Quote 2
        • Jim ErlichJ
          Jim Erlich @guy038
          last edited by

          @guy038 said in Delete line with duplicate Number:

          Hello, @jim-erlich,

          Before finding a way to solve your problem, we need additional information :

          • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

          • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

          • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

          • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

          • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

          • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

          Note that we can cope with a possible space character between the # symbol and the number Not a problem !

          See you later !

          Best regards,

          guy038

          • 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
          • Right now the lines are in Ascending order based upon the number.

          For example:
          The dog went to the park - #4599
          The cat went to the park - #4657
          The kid went to the park - #4797
          The lizard went to the zoo - #5100
          The cat went to the zoo - #5120
          etc…

          Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers

          • The number of digits after the # symbol is 1 to 5… the highest number being about 14000
          • Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this

          For example:
          The dog went to the park - #12554
          The cat went to the park - #12554

          ^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.

          • There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
          • Keep the last duplicate line, delete the first

          I hope this is clear, let me know if there is anything else.

          Thank you

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hi, @jim-erlich and All,

            Many thanks for all your information ! It should be very easy to get the right solution !

            The most important points are :

            • Your file is already sorted by ascending #number

            • And, in case of 1 duplicate line, it is located right after the original line !


            Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

            For instance :

            Are lines sorted, as below ( case A ) :

            The dog went to the park - #4599
            The cat went to the park - # 4657
            The kid went to the park - #4797
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5120
            

            OR

            Are lines sorted, like ( case B ) :

            The cat went to the park - # 4657
            The lizard went to the zoo - # 5100
            The dog went to the park - #4599
            The kid went to the park - #4797
            The cat went to the zoo - #5120
            

            Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


            Anyway, assuming a sort, like in case A and the initial text :

            The dog went to the park - #4599
            You went to the zoo - # 4640
            He went to the park - # 4640
            The cat went to the park - #4657
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5100
            I went to the park - #7500
            We went to the zoo - #7500
            They went to the park - #14000
            

            Here is the road map :

            • Open your file in N++

            • Open the Replace dialog ( Ctrl + H )

            • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

            • REPLACE Leave EMPTY

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Click once on the Replace All button or several times on the Replace button

            Voila !

            You should get your expected list :

            The dog went to the park - #4599
            He went to the park - # 4640
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The cat went to the zoo - #5100
            We went to the zoo - #7500
            They went to the park - #14000
            

            Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

            Next time, if everything was OK, I’ll explain how this regex S/R works !

            Best Regards,

            guy038

            Jim ErlichJ 1 Reply Last reply Reply Quote 1
            • Jim ErlichJ
              Jim Erlich @guy038
              last edited by

              @guy038 said in Delete line with duplicate Number:

              Hi, @jim-erlich and All,

              Many thanks for all your information ! It should be very easy to get the right solution !

              The most important points are :

              • Your file is already sorted by ascending #number

              • And, in case of 1 duplicate line, it is located right after the original line !


              Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

              For instance :

              Are lines sorted, as below ( case A ) :

              The dog went to the park - #4599
              The cat went to the park - # 4657
              The kid went to the park - #4797
              The lizard went to the zoo - # 5100
              The cat went to the zoo - #5120
              

              OR

              Are lines sorted, like ( case B ) :

              The cat went to the park - # 4657
              The lizard went to the zoo - # 5100
              The dog went to the park - #4599
              The kid went to the park - #4797
              The cat went to the zoo - #5120
              

              Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


              Anyway, assuming a sort, like in case A and the initial text :

              The dog went to the park - #4599
              You went to the zoo - # 4640
              He went to the park - # 4640
              The cat went to the park - #4657
              The kid went to the park - # 4657
              The girl went to the park - #4900
              The lizard went to the zoo - # 5100
              The cat went to the zoo - #5100
              I went to the park - #7500
              We went to the zoo - #7500
              They went to the park - #14000
              

              Here is the road map :

              • Open your file in N++

              • Open the Replace dialog ( Ctrl + H )

              • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

              • REPLACE Leave EMPTY

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click once on the Replace All button or several times on the Replace button

              Voila !

              You should get your expected list :

              The dog went to the park - #4599
              He went to the park - # 4640
              The kid went to the park - # 4657
              The girl went to the park - #4900
              The cat went to the zoo - #5100
              We went to the zoo - #7500
              They went to the park - #14000
              

              Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

              Next time, if everything was OK, I’ll explain how this regex S/R works !

              Best Regards,

              guy038

              Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hello @jim-erlich and All,

                Sorry for being late ! So, here are, below, some explanations about my regex S/R :

                SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

                REPLACE Leave EMPTY

                • First, the (?-s) in-line modifier ensures that any further . regex symbol corresponds to a single standard character, only and not to a line-break char !

                • So, the next part ^.+#\x20? searches, from beginning of line ( ^ ), any non-null range of characters ( .+ ), followed by the # symbol and an optional space char (\x20?)

                • Then, it looks for a non-null range of digits ( \d+ ), followed by line-break character(s)

                • So, the regex engine looks for an entire line ( digits after the # are stored as group 1 as embedded in parentheses ) but ONLY IF the next line ends with the same number !

                • This condition can be expressed with a look-ahead structure (?=......) which are rather a user assertion in the same way that, for instance, the $ symbol is a system assertion, looking for the zero length assertion “end of line” !

                • So current line must be followed with the regex .+#\x20?\1, which represents, again, a non-null range of standard characters followed with a # and possibly a space char and finally the group 1 ( \1 ) which is the ending number of the current line

                • Note that the ^ assertion for the second line, in the look-ahead structure, is useless as the range (.+) comes next the line-break char(s) \R, anyway !

                • As the replacement zone is empty, the current line, with its line-break, is just deleted


                For a quick oversight about regular expressions, see the N++ documentation, below :

                https://npp-user-manual.org/docs/searching/#regular-expressions

                See also the main links regarding the Boost regex library, used by the regex N++ engine :

                https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

                https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

                Finally, see this FAQ topic about regular expressions :

                https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors