Community
    • Login

    Remove data lines with same value in certain column

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 4 Posters 403 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises @Jizhao Li
      last edited by Coises

      @Jizhao-Li

      I’m a bit confused. In the example you gave, the desired results you showed don’t match the description you gave. Were the actual desired results:

      #391393
      ! Node 990292 Mine Year2Ø33
      ! [type—Linear ; option—linear ; timeunit=d ; unitc1ass=CARDINAL; userunit=]
      748 302
      8765 287.13
      13879 302
      END
      

      If so, you can do that by selecting Search | Replace… from the menu, entering:

      Find what: ^\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
      Replace with: \2 \1
      Wrap around checked
      Search Mode: Regular expression

      and clicking Replace All.

      This assumes that, as in your example data, the first column is all integers but the second column can have decimals.

      Jizhao LiJ 1 Reply Last reply Reply Quote 0
      • Jizhao LiJ
        Jizhao Li @Coises
        last edited by

        @Coises

        Hi, thank you so much for your quick reply.

        Sorry my description is off and hard to understand. But the desired data format in the second screenshot is correct.

        Hopefully picture below can clarify my question better:

        c4d1b8a1-ace8-464c-b535-eb46e757b224-image.png
        7b961a11-b494-4a1b-88a9-18f145e8cefe-image.png

        I want to delete the rows circled in red and keep the rest.

        Thank you!!!

        CoisesC 1 Reply Last reply Reply Quote 0
        • CoisesC
          Coises @Jizhao Li
          last edited by

          @Jizhao-Li

          I feel like I might still be missing something.

          Are you saying that you only want to delete the lines with duplicate second values in the first group that follows a line with only the letters GAP in it?

          If that’s the case, then try changing my suggestion to:

          Find what: ^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
          Replace with: GAP\r\n\2 \1

          I’m assuming you want to replace this more than once (or you’d just do it by hand) — either it occurs multiple times in one file, or one time in multiple files, or multiple times in multiple files — but I’m having trouble grasping what differentiates the specific group of lines you want to remove from the other groups of lines that follow the same pattern.

          Jizhao LiJ 1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by

            Hello, @jizhao-li, @coises and All,

            If I fully understand what you expect to, this means that :

            • After a GAP indication, you want to replace all lines of text, before a first line with a decimal number, by the last line with integer numbers only

            • And the text, from the first line with a decimal number, till the next END indication, is not changed at all


            If so, the regex S/R, below, should do the work, nicely :

            SEARCH (?-s)^(GAP\R)(\d+ \d+\R)+(?=.+\.)

            REPLACE $1$2


            Note that this regex assumes that no blank character exists at the end of all the lines, before the first line with a decimal number !

            Else, just use the Edit > Blank Operations > Trim Trailing Space to suppress these extra blank characters at end of lines


            As you can see, only the last occurrence of the group \d+ \d+\R is re-used, in the replacement phase, with the $2 syntax

            Best regards

            guy038

            1 Reply Last reply Reply Quote 0
            • Mark OlsonM
              Mark Olson
              last edited by

              @Jizhao-Li

              If you are working with single files that have 10 million lines of data in a tabular format, I strongly recommend that you use a scripting language like Python (specifically a library like pandas or polars) to manipulate those files, rather than using regular expressions in Notepad++. Not only is Python well-suited for automation of such tasks, a Python script with the right packages would probably perform the desired manipulations many times faster than a comparable solution in Notepad++.

              This is especially true if you are working with many such files, since the Find/Replace in Files functionality in Notepad++ is rather slow compared to many alternatives.

              To be clear, I’m not dumping on Notepad++ in general, simply saying that its regex search/replace functionality does not scale very well to extremely large files or large numbers of files. I quite like the flexibility of Notepad++ regex search/replace for modestly sized files.

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hi, @jizhao-li, @coises, @mark-olson and All,

                Despite the legitimate warning, provided by @mark-olson, I was under the impression that my replacement would work correctly, even with a file of 10 million lines

                • Since each character, less than \x{80}, is coded with one byte only

                • Since the average length of each line is around 9.9 characters, only

                Thus, I created such a file containing 897,000 lines for 10,668,000 bytes and… guess what ?

                => Even with my out of date XP laptop, it took about 13.6 s only, to modify the 15,000 occurrences of the Test_Jizhao.txt file !

                After replacement, this file now contains 561,000 lines for 7,308,000 bytes. And, of course, with a regex, I verified that between a GAP line and the first line with decimal number, only one line with two integer numbers exists

                Best Regards,

                guy038

                Jizhao LiJ 1 Reply Last reply Reply Quote 0
                • Jizhao LiJ
                  Jizhao Li @guy038
                  last edited by

                  Hi @guy038 ,

                  Thank you so much for the help.

                  Your regex worked for most of my lines. And for the lines that didn’t work, it’s not your code not working, but I didn’t describe my problem in enough detail.

                  Your understanding is mostly correct, but there are two circumstances I missed clarifying:

                  1. About the line with decimal number. The line beforewhich I want to replace all lines of text, doesn’t need to have a decimal number in the second column, but a different value than what the earlier lines (lines I want to remove) have. See examples below:
                    20e5d0c1-0a7f-4e7e-b540-e23f38d8900a-image.png
                    8d4194c8-5479-459a-b8b2-6f8e91446dde-image.png

                  2. There are some circumstances where the second column values are the same for all rows between a GAP and a END. In these cases, do not delete any lines, keep them all. See example below:
                    e7cbbe0e-b959-43e5-a0b1-a8c26a65eee0-image.png
                    3b3c2efc-49b6-4ea0-bf79-53538d16e610-image.png

                  Thank you so much for your help. Your regex definitely helps. If you could slightly adjust it to account for the two circumstances I mentioned above that would be great and much appreciated!

                  1 Reply Last reply Reply Quote 0
                  • Jizhao LiJ
                    Jizhao Li @Coises
                    last edited by

                    Hi @Coises,

                    Thank you so much for your help. I just tried your regex codes and it worked almost 100% for my problem.

                    There is one circumstance where it didn’t give me the format I wanted, which is shown in the screenshot below:
                    In this case there is no duplicate second values in this group of data and I actually want to keep all the rows as they are and don’t delete any rows. But using your regex code all the rows except for the last row were deleted.

                    53109755-55c6-4373-981c-2e81e8ee29b1-image.png

                    Basically your code works perfectly when there IS a duplicate values in the second column. But when there is NO duplicate values in the second column, I want to keep the original data as they are.

                    I hope this doesn’t add more confusion. Could you please adjust the code and send it to me? Thank you so much!

                    CoisesC 1 Reply Last reply Reply Quote 0
                    • CoisesC
                      Coises @Jizhao Li
                      last edited by

                      @Jizhao-Li said in Remove data lines with same value in certain column:

                      I think I understand. You want to delete all but the last line from the first set of lines with the same second number that follows “GAP,” but only if there is at least one additional set of lines following.

                      Try changing the find expression to:
                      ^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1(?=\R\d+ \d+$)(?!\R\d+ \1$)

                      That’s the same test, except it checks to see that it is followed by a line that has two numbers, but not a line with has two numbers with the same second number.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @jizhao-li, @coises, @mark-olson and All,

                        Ah… OK. Thanks, @jizhao-li, for your additional pieces of information :-)


                        Then, from this INPUT text, below :

                        GAP
                        1461 311
                        1826 311
                        2191 311
                        2556 311
                        2921 311
                        3287 311
                        3652 311
                        4017 311
                        4382 311
                        4748 311
                        5113 311
                        5478 311
                        5843 311
                        6209 311
                        6574 311
                        6939 311
                        7304 311
                        7670 311
                        8035 311
                        8400 311
                        8765 311
                        9131 311
                        9496 311
                        9861 311
                        10026 311
                        10592 311
                        10957 311
                        11322 311
                        11687 311
                        12053 311
                        12418 311
                        12783 308
                        13148 308
                        13514 308
                        13789 308
                        END
                        
                        GAP
                        1461 303
                        1826 303
                        2191 303
                        2556 303
                        2921 303
                        3287 303
                        3652 303
                        4017 303
                        4382 303
                        4748 303
                        5113 303
                        5478 303
                        5843 303
                        6209 303
                        6574 303
                        6939 303
                        7304 303
                        7670 303
                        8035 303
                        8400 303
                        8765 303
                        9131 303
                        9496 303
                        9861 303
                        10026 303
                        10592 303
                        10957 303
                        11322 303
                        11687 303
                        12053 303
                        12418 303
                        12783 303
                        13148 303
                        13514 303
                        13789 303
                        END
                        
                        GAP
                        1461 327.5
                        1826 327.5
                        2191 327.5
                        2556 327.5
                        2921 327.5
                        3287 327.5
                        3652 327.5
                        4017 327.5
                        4382 327.5
                        4748 327.5
                        5113 327.5
                        5478 327.5
                        5843 327.5
                        6209 327.5
                        6574 327.5
                        6939 327.5
                        7304 327.5
                        7670 327.5
                        8035 327.5
                        8400 327.5
                        8765 327.5
                        9131 327.5
                        9496 327.5
                        9861 327.5
                        10026 327.5
                        10592 327.5
                        10957 327.5
                        11322 327.5
                        11687 327.5
                        12053 327.5
                        12418 327.5
                        12783 327.5
                        13148 327.5
                        13514 327.5
                        13789 327.5
                        END
                        
                        GAP
                        1461 302
                        1826 302
                        2191 299
                        2556 299
                        2921 299
                        3287 299
                        3652 299
                        4017 299
                        4382 299
                        4748 299
                        5113 299
                        5478 299
                        5843 299
                        6209 302
                        6574 302
                        6939 302
                        7304 302
                        7670 302
                        8035 302
                        8400 302
                        8765 302
                        9131 302
                        9496 302
                        9861 302
                        10026 302
                        10592 302
                        10957 302
                        11322 302
                        11687 302
                        12053 302
                        12418 302
                        12783 302
                        13148 302
                        13514 302
                        13789 302
                        END
                        

                        The following regex S/R should do the work nicely :

                        SEARCH (?-is)^(GAP\R)(\d+ (\d+(?:\.\d+)?)\R)+?(?=\d+ (?!\3$))

                        REPLACE \1\2

                        And here is your expected OUTPUT text :

                        GAP
                        12418 311
                        12783 308
                        13148 308
                        13514 308
                        13789 308
                        END
                        
                        GAP
                        1461 303
                        1826 303
                        2191 303
                        2556 303
                        2921 303
                        3287 303
                        3652 303
                        4017 303
                        4382 303
                        4748 303
                        5113 303
                        5478 303
                        5843 303
                        6209 303
                        6574 303
                        6939 303
                        7304 303
                        7670 303
                        8035 303
                        8400 303
                        8765 303
                        9131 303
                        9496 303
                        9861 303
                        10026 303
                        10592 303
                        10957 303
                        11322 303
                        11687 303
                        12053 303
                        12418 303
                        12783 303
                        13148 303
                        13514 303
                        13789 303
                        END
                        
                        GAP
                        1461 327.5
                        1826 327.5
                        2191 327.5
                        2556 327.5
                        2921 327.5
                        3287 327.5
                        3652 327.5
                        4017 327.5
                        4382 327.5
                        4748 327.5
                        5113 327.5
                        5478 327.5
                        5843 327.5
                        6209 327.5
                        6574 327.5
                        6939 327.5
                        7304 327.5
                        7670 327.5
                        8035 327.5
                        8400 327.5
                        8765 327.5
                        9131 327.5
                        9496 327.5
                        9861 327.5
                        10026 327.5
                        10592 327.5
                        10957 327.5
                        11322 327.5
                        11687 327.5
                        12053 327.5
                        12418 327.5
                        12783 327.5
                        13148 327.5
                        13514 327.5
                        13789 327.5
                        END
                        
                        GAP
                        1826 302
                        2191 299
                        2556 299
                        2921 299
                        3287 299
                        3652 299
                        4017 299
                        4382 299
                        4748 299
                        5113 299
                        5478 299
                        5843 299
                        6209 302
                        6574 302
                        6939 302
                        7304 302
                        7670 302
                        8035 302
                        8400 302
                        8765 302
                        9131 302
                        9496 302
                        9861 302
                        10026 302
                        10592 302
                        10957 302
                        11322 302
                        11687 302
                        12053 302
                        12418 302
                        12783 302
                        13148 302
                        13514 302
                        13789 302
                        END
                        

                        Some other tests, all OK ! On the left the INPUT text and on the right the OUTPUT text :

                        GAP
                        4017 123
                        4382 123               GAP
                        4748 123               5478 123
                        5113 123               5843 456
                        5478 123               6209 456
                        5843 456         =>    6574 456
                        6209 456               6939 456
                        6574 456               7304 456
                        6939 456               END
                        7304 456
                        END
                        
                        GAP
                        4017 123
                        4382 123               GAP
                        4748 123               5478 123
                        5113 123               5843 456.789
                        5478 123               6209 456.789
                        5843 456.789     =>    6574 456.789
                        6209 456.789           6939 456.789
                        6574 456.789           7304 456.789
                        6939 456.789           END
                        7304 456.789
                        END
                        
                        GAP
                        4017 123
                        4382 123               GAP
                        4748 123               5478 123
                        5113 123               5843 123.789
                        5478 123               6209 123.789
                        5843 123.789     =>    6574 123.789
                        6209 123.789           6939 123.789
                        6574 123.789           7304 123.789
                        6939 123.789           END
                        7304 123.789
                        END
                        
                        GAP
                        4017 123.012
                        4382 123.012           GAP
                        4748 123.012           5478 123.012
                        5113 123.012           5843 456
                        5478 123.012     =>    6209 456
                        5843 456               6574 456
                        6209 456               6939 456
                        6574 456               7304 456
                        6939 456               END
                        7304 456
                        END
                        
                        GAP
                        4017 123.012
                        4382 123.012           GAP
                        4748 123.012           5478 123.012
                        5113 123.012           5843 123
                        5478 123.012     =>    6209 123
                        5843 123               6574 123
                        6209 123               6939 123
                        6574 123               7304 123
                        6939 123               END
                        7304 123
                        END
                        
                        GAP
                        4017 123.012
                        4382 123.012           GAP
                        4748 123.012           5478 123.012
                        5113 123.012           5843 456.789
                        5478 123.012           6209 456.789
                        5843 456.789     =>    6574 456.789
                        6209 456.789           6939 456.789
                        6574 456.789           7304 456.789
                        6939 456.789           END
                        7304 456.789
                        END
                        
                        GAP
                        4017 123               GAP
                        4382 123               4748 123
                        4748 123               5113 456
                        5113 456               5478 456
                        5478 456               5843 456
                        5843 456         =>    6209 456
                        6209 456               6574 123
                        6574 123               6939 123
                        6939 123               7304 123
                        7304 123               END
                        END
                        
                        GAP
                        4017 123.012           GAP
                        4382 123.012           4748 123.012
                        4748 123.012           5113 456.789
                        5113 456.789           5478 456.789
                        5478 456.789           5843 456.789
                        5843 456.789     =>    6209 456.789
                        6209 456.789           6574 123.012
                        6574 123.012           6939 123.012
                        6939 123.012           7304 123.012
                        7304 123.012           END
                        END
                        

                        Note that your file will be unchanged if you redo the search-replacement !!

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors