Community
    • Login

    Remove data lines with same value in certain column

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 4 Posters 953 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jizhao LiJ
      Jizhao Li
      last edited by

      Hello,

      I have a very large dataset with over 10 million lines of data.

      The data format is shown in screenshot below
      8b878ff5-1f06-4640-9455-ac1159101107-image.png

      I have more than 400,000 set of these data line by line in this text dataset file. The first 4 lines in the example are like headers and the data I want to clean up starts with 1461 302; 1826 302, etc. In the second column, values start with 302, then change to 287.13, then return back to 302. Different small dataset have different values. But the pattern is the same.

      I want to remove the lines where the second column value are the same before it changes to a different value. And I want to keep the last row of data before second column value changes.

      So my desired data format will be as below:
      dccfe49b-f899-4615-a4dd-165973fadd2c-image.png

      I successfully did this operation last year using Notepad++ but forgot how I did it. What I remember was I used the Search/Replace function and Line Operation and Bookmark function to first select all the rows I want to delete, then delete them.

      Please help me with this issue. Very much appreciated!

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @Jizhao Li
        last edited by Coises

        @Jizhao-Li

        I’m a bit confused. In the example you gave, the desired results you showed don’t match the description you gave. Were the actual desired results:

        #391393
        ! Node 990292 Mine Year2Ø33
        ! [type—Linear ; option—linear ; timeunit=d ; unitc1ass=CARDINAL; userunit=]
        748 302
        8765 287.13
        13879 302
        END
        

        If so, you can do that by selecting Search | Replace… from the menu, entering:

        Find what: ^\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
        Replace with: \2 \1
        Wrap around checked
        Search Mode: Regular expression

        and clicking Replace All.

        This assumes that, as in your example data, the first column is all integers but the second column can have decimals.

        Jizhao LiJ 1 Reply Last reply Reply Quote 0
        • Jizhao LiJ
          Jizhao Li @Coises
          last edited by

          @Coises

          Hi, thank you so much for your quick reply.

          Sorry my description is off and hard to understand. But the desired data format in the second screenshot is correct.

          Hopefully picture below can clarify my question better:

          c4d1b8a1-ace8-464c-b535-eb46e757b224-image.png
          7b961a11-b494-4a1b-88a9-18f145e8cefe-image.png

          I want to delete the rows circled in red and keep the rest.

          Thank you!!!

          CoisesC 1 Reply Last reply Reply Quote 0
          • CoisesC
            Coises @Jizhao Li
            last edited by

            @Jizhao-Li

            I feel like I might still be missing something.

            Are you saying that you only want to delete the lines with duplicate second values in the first group that follows a line with only the letters GAP in it?

            If that’s the case, then try changing my suggestion to:

            Find what: ^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
            Replace with: GAP\r\n\2 \1

            I’m assuming you want to replace this more than once (or you’d just do it by hand) — either it occurs multiple times in one file, or one time in multiple files, or multiple times in multiple files — but I’m having trouble grasping what differentiates the specific group of lines you want to remove from the other groups of lines that follow the same pattern.

            Jizhao LiJ 1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by

              Hello, @jizhao-li, @coises and All,

              If I fully understand what you expect to, this means that :

              • After a GAP indication, you want to replace all lines of text, before a first line with a decimal number, by the last line with integer numbers only

              • And the text, from the first line with a decimal number, till the next END indication, is not changed at all


              If so, the regex S/R, below, should do the work, nicely :

              SEARCH (?-s)^(GAP\R)(\d+ \d+\R)+(?=.+\.)

              REPLACE $1$2


              Note that this regex assumes that no blank character exists at the end of all the lines, before the first line with a decimal number !

              Else, just use the Edit > Blank Operations > Trim Trailing Space to suppress these extra blank characters at end of lines


              As you can see, only the last occurrence of the group \d+ \d+\R is re-used, in the replacement phase, with the $2 syntax

              Best regards

              guy038

              1 Reply Last reply Reply Quote 0
              • Mark OlsonM
                Mark Olson
                last edited by

                @Jizhao-Li

                If you are working with single files that have 10 million lines of data in a tabular format, I strongly recommend that you use a scripting language like Python (specifically a library like pandas or polars) to manipulate those files, rather than using regular expressions in Notepad++. Not only is Python well-suited for automation of such tasks, a Python script with the right packages would probably perform the desired manipulations many times faster than a comparable solution in Notepad++.

                This is especially true if you are working with many such files, since the Find/Replace in Files functionality in Notepad++ is rather slow compared to many alternatives.

                To be clear, I’m not dumping on Notepad++ in general, simply saying that its regex search/replace functionality does not scale very well to extremely large files or large numbers of files. I quite like the flexibility of Notepad++ regex search/replace for modestly sized files.

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @jizhao-li, @coises, @mark-olson and All,

                  Despite the legitimate warning, provided by @mark-olson, I was under the impression that my replacement would work correctly, even with a file of 10 million lines

                  • Since each character, less than \x{80}, is coded with one byte only

                  • Since the average length of each line is around 9.9 characters, only

                  Thus, I created such a file containing 897,000 lines for 10,668,000 bytes and… guess what ?

                  => Even with my out of date XP laptop, it took about 13.6 s only, to modify the 15,000 occurrences of the Test_Jizhao.txt file !

                  After replacement, this file now contains 561,000 lines for 7,308,000 bytes. And, of course, with a regex, I verified that between a GAP line and the first line with decimal number, only one line with two integer numbers exists

                  Best Regards,

                  guy038

                  Jizhao LiJ 1 Reply Last reply Reply Quote 0
                  • Jizhao LiJ
                    Jizhao Li @guy038
                    last edited by

                    Hi @guy038 ,

                    Thank you so much for the help.

                    Your regex worked for most of my lines. And for the lines that didn’t work, it’s not your code not working, but I didn’t describe my problem in enough detail.

                    Your understanding is mostly correct, but there are two circumstances I missed clarifying:

                    1. About the line with decimal number. The line beforewhich I want to replace all lines of text, doesn’t need to have a decimal number in the second column, but a different value than what the earlier lines (lines I want to remove) have. See examples below:
                      20e5d0c1-0a7f-4e7e-b540-e23f38d8900a-image.png
                      8d4194c8-5479-459a-b8b2-6f8e91446dde-image.png

                    2. There are some circumstances where the second column values are the same for all rows between a GAP and a END. In these cases, do not delete any lines, keep them all. See example below:
                      e7cbbe0e-b959-43e5-a0b1-a8c26a65eee0-image.png
                      3b3c2efc-49b6-4ea0-bf79-53538d16e610-image.png

                    Thank you so much for your help. Your regex definitely helps. If you could slightly adjust it to account for the two circumstances I mentioned above that would be great and much appreciated!

                    1 Reply Last reply Reply Quote 0
                    • Jizhao LiJ
                      Jizhao Li @Coises
                      last edited by

                      Hi @Coises,

                      Thank you so much for your help. I just tried your regex codes and it worked almost 100% for my problem.

                      There is one circumstance where it didn’t give me the format I wanted, which is shown in the screenshot below:
                      In this case there is no duplicate second values in this group of data and I actually want to keep all the rows as they are and don’t delete any rows. But using your regex code all the rows except for the last row were deleted.

                      53109755-55c6-4373-981c-2e81e8ee29b1-image.png

                      Basically your code works perfectly when there IS a duplicate values in the second column. But when there is NO duplicate values in the second column, I want to keep the original data as they are.

                      I hope this doesn’t add more confusion. Could you please adjust the code and send it to me? Thank you so much!

                      CoisesC 1 Reply Last reply Reply Quote 0
                      • CoisesC
                        Coises @Jizhao Li
                        last edited by

                        @Jizhao-Li said in Remove data lines with same value in certain column:

                        I think I understand. You want to delete all but the last line from the first set of lines with the same second number that follows “GAP,” but only if there is at least one additional set of lines following.

                        Try changing the find expression to:
                        ^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1(?=\R\d+ \d+$)(?!\R\d+ \1$)

                        That’s the same test, except it checks to see that it is followed by a line that has two numbers, but not a line with has two numbers with the same second number.

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, @jizhao-li, @coises, @mark-olson and All,

                          Ah… OK. Thanks, @jizhao-li, for your additional pieces of information :-)


                          Then, from this INPUT text, below :

                          GAP
                          1461 311
                          1826 311
                          2191 311
                          2556 311
                          2921 311
                          3287 311
                          3652 311
                          4017 311
                          4382 311
                          4748 311
                          5113 311
                          5478 311
                          5843 311
                          6209 311
                          6574 311
                          6939 311
                          7304 311
                          7670 311
                          8035 311
                          8400 311
                          8765 311
                          9131 311
                          9496 311
                          9861 311
                          10026 311
                          10592 311
                          10957 311
                          11322 311
                          11687 311
                          12053 311
                          12418 311
                          12783 308
                          13148 308
                          13514 308
                          13789 308
                          END
                          
                          GAP
                          1461 303
                          1826 303
                          2191 303
                          2556 303
                          2921 303
                          3287 303
                          3652 303
                          4017 303
                          4382 303
                          4748 303
                          5113 303
                          5478 303
                          5843 303
                          6209 303
                          6574 303
                          6939 303
                          7304 303
                          7670 303
                          8035 303
                          8400 303
                          8765 303
                          9131 303
                          9496 303
                          9861 303
                          10026 303
                          10592 303
                          10957 303
                          11322 303
                          11687 303
                          12053 303
                          12418 303
                          12783 303
                          13148 303
                          13514 303
                          13789 303
                          END
                          
                          GAP
                          1461 327.5
                          1826 327.5
                          2191 327.5
                          2556 327.5
                          2921 327.5
                          3287 327.5
                          3652 327.5
                          4017 327.5
                          4382 327.5
                          4748 327.5
                          5113 327.5
                          5478 327.5
                          5843 327.5
                          6209 327.5
                          6574 327.5
                          6939 327.5
                          7304 327.5
                          7670 327.5
                          8035 327.5
                          8400 327.5
                          8765 327.5
                          9131 327.5
                          9496 327.5
                          9861 327.5
                          10026 327.5
                          10592 327.5
                          10957 327.5
                          11322 327.5
                          11687 327.5
                          12053 327.5
                          12418 327.5
                          12783 327.5
                          13148 327.5
                          13514 327.5
                          13789 327.5
                          END
                          
                          GAP
                          1461 302
                          1826 302
                          2191 299
                          2556 299
                          2921 299
                          3287 299
                          3652 299
                          4017 299
                          4382 299
                          4748 299
                          5113 299
                          5478 299
                          5843 299
                          6209 302
                          6574 302
                          6939 302
                          7304 302
                          7670 302
                          8035 302
                          8400 302
                          8765 302
                          9131 302
                          9496 302
                          9861 302
                          10026 302
                          10592 302
                          10957 302
                          11322 302
                          11687 302
                          12053 302
                          12418 302
                          12783 302
                          13148 302
                          13514 302
                          13789 302
                          END
                          

                          The following regex S/R should do the work nicely :

                          SEARCH (?-is)^(GAP\R)(\d+ (\d+(?:\.\d+)?)\R)+?(?=\d+ (?!\3$))

                          REPLACE \1\2

                          And here is your expected OUTPUT text :

                          GAP
                          12418 311
                          12783 308
                          13148 308
                          13514 308
                          13789 308
                          END
                          
                          GAP
                          1461 303
                          1826 303
                          2191 303
                          2556 303
                          2921 303
                          3287 303
                          3652 303
                          4017 303
                          4382 303
                          4748 303
                          5113 303
                          5478 303
                          5843 303
                          6209 303
                          6574 303
                          6939 303
                          7304 303
                          7670 303
                          8035 303
                          8400 303
                          8765 303
                          9131 303
                          9496 303
                          9861 303
                          10026 303
                          10592 303
                          10957 303
                          11322 303
                          11687 303
                          12053 303
                          12418 303
                          12783 303
                          13148 303
                          13514 303
                          13789 303
                          END
                          
                          GAP
                          1461 327.5
                          1826 327.5
                          2191 327.5
                          2556 327.5
                          2921 327.5
                          3287 327.5
                          3652 327.5
                          4017 327.5
                          4382 327.5
                          4748 327.5
                          5113 327.5
                          5478 327.5
                          5843 327.5
                          6209 327.5
                          6574 327.5
                          6939 327.5
                          7304 327.5
                          7670 327.5
                          8035 327.5
                          8400 327.5
                          8765 327.5
                          9131 327.5
                          9496 327.5
                          9861 327.5
                          10026 327.5
                          10592 327.5
                          10957 327.5
                          11322 327.5
                          11687 327.5
                          12053 327.5
                          12418 327.5
                          12783 327.5
                          13148 327.5
                          13514 327.5
                          13789 327.5
                          END
                          
                          GAP
                          1826 302
                          2191 299
                          2556 299
                          2921 299
                          3287 299
                          3652 299
                          4017 299
                          4382 299
                          4748 299
                          5113 299
                          5478 299
                          5843 299
                          6209 302
                          6574 302
                          6939 302
                          7304 302
                          7670 302
                          8035 302
                          8400 302
                          8765 302
                          9131 302
                          9496 302
                          9861 302
                          10026 302
                          10592 302
                          10957 302
                          11322 302
                          11687 302
                          12053 302
                          12418 302
                          12783 302
                          13148 302
                          13514 302
                          13789 302
                          END
                          

                          Some other tests, all OK ! On the left the INPUT text and on the right the OUTPUT text :

                          GAP
                          4017 123
                          4382 123               GAP
                          4748 123               5478 123
                          5113 123               5843 456
                          5478 123               6209 456
                          5843 456         =>    6574 456
                          6209 456               6939 456
                          6574 456               7304 456
                          6939 456               END
                          7304 456
                          END
                          
                          GAP
                          4017 123
                          4382 123               GAP
                          4748 123               5478 123
                          5113 123               5843 456.789
                          5478 123               6209 456.789
                          5843 456.789     =>    6574 456.789
                          6209 456.789           6939 456.789
                          6574 456.789           7304 456.789
                          6939 456.789           END
                          7304 456.789
                          END
                          
                          GAP
                          4017 123
                          4382 123               GAP
                          4748 123               5478 123
                          5113 123               5843 123.789
                          5478 123               6209 123.789
                          5843 123.789     =>    6574 123.789
                          6209 123.789           6939 123.789
                          6574 123.789           7304 123.789
                          6939 123.789           END
                          7304 123.789
                          END
                          
                          GAP
                          4017 123.012
                          4382 123.012           GAP
                          4748 123.012           5478 123.012
                          5113 123.012           5843 456
                          5478 123.012     =>    6209 456
                          5843 456               6574 456
                          6209 456               6939 456
                          6574 456               7304 456
                          6939 456               END
                          7304 456
                          END
                          
                          GAP
                          4017 123.012
                          4382 123.012           GAP
                          4748 123.012           5478 123.012
                          5113 123.012           5843 123
                          5478 123.012     =>    6209 123
                          5843 123               6574 123
                          6209 123               6939 123
                          6574 123               7304 123
                          6939 123               END
                          7304 123
                          END
                          
                          GAP
                          4017 123.012
                          4382 123.012           GAP
                          4748 123.012           5478 123.012
                          5113 123.012           5843 456.789
                          5478 123.012           6209 456.789
                          5843 456.789     =>    6574 456.789
                          6209 456.789           6939 456.789
                          6574 456.789           7304 456.789
                          6939 456.789           END
                          7304 456.789
                          END
                          
                          GAP
                          4017 123               GAP
                          4382 123               4748 123
                          4748 123               5113 456
                          5113 456               5478 456
                          5478 456               5843 456
                          5843 456         =>    6209 456
                          6209 456               6574 123
                          6574 123               6939 123
                          6939 123               7304 123
                          7304 123               END
                          END
                          
                          GAP
                          4017 123.012           GAP
                          4382 123.012           4748 123.012
                          4748 123.012           5113 456.789
                          5113 456.789           5478 456.789
                          5478 456.789           5843 456.789
                          5843 456.789     =>    6209 456.789
                          6209 456.789           6574 123.012
                          6574 123.012           6939 123.012
                          6939 123.012           7304 123.012
                          7304 123.012           END
                          END
                          

                          Note that your file will be unchanged if you redo the search-replacement !!

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors