Remove data lines with same value in certain column

Jizhao Li

Hello,

I have a very large dataset with over 10 million lines of data.

The data format is shown in screenshot below

I have more than 400,000 set of these data line by line in this text dataset file. The first 4 lines in the example are like headers and the data I want to clean up starts with 1461 302; 1826 302, etc. In the second column, values start with 302, then change to 287.13, then return back to 302. Different small dataset have different values. But the pattern is the same.

I want to remove the lines where the second column value are the same before it changes to a different value. And I want to keep the last row of data before second column value changes.

So my desired data format will be as below:

I successfully did this operation last year using Notepad++ but forgot how I did it. What I remember was I used the Search/Replace function and Line Operation and Bookmark function to first select all the rows I want to delete, then delete them.

Please help me with this issue. Very much appreciated!

Coises

@Jizhao-Li

I’m a bit confused. In the example you gave, the desired results you showed don’t match the description you gave. Were the actual desired results:

#391393
! Node 990292 Mine Year2Ø33
! [type—Linear ; option—linear ; timeunit=d ; unitc1ass=CARDINAL; userunit=]
748 302
8765 287.13
13879 302
END

If so, you can do that by selecting Search | Replace… from the menu, entering:

Find what: ^\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
Replace with: \2 \1
Wrap around checked
Search Mode: Regular expression

and clicking Replace All.

This assumes that, as in your example data, the first column is all integers but the second column can have decimals.

Jizhao Li

@Coises

Hi, thank you so much for your quick reply.

Sorry my description is off and hard to understand. But the desired data format in the second screenshot is correct.

Hopefully picture below can clarify my question better:

I want to delete the rows circled in red and keep the rest.

Thank you!!!

Coises

@Jizhao-Li

I feel like I might still be missing something.

Are you saying that you only want to delete the lines with duplicate second values in the first group that follows a line with only the letters GAP in it?

If that’s the case, then try changing my suggestion to:

Find what: ^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
Replace with: GAP\r\n\2 \1

I’m assuming you want to replace this more than once (or you’d just do it by hand) — either it occurs multiple times in one file, or one time in multiple files, or multiple times in multiple files — but I’m having trouble grasping what differentiates the specific group of lines you want to remove from the other groups of lines that follow the same pattern.

guy038

Hello, @jizhao-li, @coises and All,

If I fully understand what you expect to, this means that :

After a GAP indication, you want to replace all lines of text, before a first line with a decimal number, by the last line with integer numbers only
And the text, from the first line with a decimal number, till the next END indication, is not changed at all

If so, the regex S/R, below, should do the work, nicely :

SEARCH (?-s)^(GAP\R)(\d+ \d+\R)+(?=.+\.)

REPLACE $1$2

Note that this regex assumes that no blank character exists at the end of all the lines, before the first line with a decimal number !

Else, just use the Edit > Blank Operations > Trim Trailing Space to suppress these extra blank characters at end of lines

As you can see, only the last occurrence of the group \d+ \d+\R is re-used, in the replacement phase, with the $2 syntax

Best regards

guy038

Mark Olson

@Jizhao-Li

If you are working with single files that have 10 million lines of data in a tabular format, I strongly recommend that you use a scripting language like Python (specifically a library like pandas or polars) to manipulate those files, rather than using regular expressions in Notepad++. Not only is Python well-suited for automation of such tasks, a Python script with the right packages would probably perform the desired manipulations many times faster than a comparable solution in Notepad++.

This is especially true if you are working with many such files, since the Find/Replace in Files functionality in Notepad++ is rather slow compared to many alternatives.

To be clear, I’m not dumping on Notepad++ in general, simply saying that its regex search/replace functionality does not scale very well to extremely large files or large numbers of files. I quite like the flexibility of Notepad++ regex search/replace for modestly sized files.

guy038

Hi, @jizhao-li, @coises, @mark-olson and All,

Despite the legitimate warning, provided by @mark-olson, I was under the impression that my replacement would work correctly, even with a file of 10 million lines

Since each character, less than \x{80}, is coded with one byte only
Since the average length of each line is around 9.9 characters, only

Thus, I created such a file containing 897,000 lines for 10,668,000 bytes and… guess what ?

=> Even with my out of date XP laptop, it took about 13.6 s only, to modify the 15,000 occurrences of the Test_Jizhao.txt file !

After replacement, this file now contains 561,000 lines for 7,308,000 bytes. And, of course, with a regex, I verified that between a GAP line and the first line with decimal number, only one line with two integer numbers exists

Best Regards,

guy038

Jizhao Li

Hi @guy038 ,

Thank you so much for the help.

Your regex worked for most of my lines. And for the lines that didn’t work, it’s not your code not working, but I didn’t describe my problem in enough detail.

Your understanding is mostly correct, but there are two circumstances I missed clarifying:

About the line with decimal number. The line beforewhich I want to replace all lines of text, doesn’t need to have a decimal number in the second column, but a different value than what the earlier lines (lines I want to remove) have. See examples below:
There are some circumstances where the second column values are the same for all rows between a GAP and a END. In these cases, do not delete any lines, keep them all. See example below:

Thank you so much for your help. Your regex definitely helps. If you could slightly adjust it to account for the two circumstances I mentioned above that would be great and much appreciated!

Jizhao Li

Hi @Coises,

Thank you so much for your help. I just tried your regex codes and it worked almost 100% for my problem.

There is one circumstance where it didn’t give me the format I wanted, which is shown in the screenshot below:
In this case there is no duplicate second values in this group of data and I actually want to keep all the rows as they are and don’t delete any rows. But using your regex code all the rows except for the last row were deleted.

Basically your code works perfectly when there IS a duplicate values in the second column. But when there is NO duplicate values in the second column, I want to keep the original data as they are.

I hope this doesn’t add more confusion. Could you please adjust the code and send it to me? Thank you so much!

Coises

@Jizhao-Li said in Remove data lines with same value in certain column:

I think I understand. You want to delete all but the last line from the first set of lines with the same second number that follows “GAP,” but only if there is at least one additional set of lines following.

Try changing the find expression to:
^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1(?=\R\d+ \d+$)(?!\R\d+ \1$)

That’s the same test, except it checks to see that it is followed by a line that has two numbers, but not a line with has two numbers with the same second number.

guy038

Hi, @jizhao-li, @coises, @mark-olson and All,

Ah… OK. Thanks, @jizhao-li, for your additional pieces of information :-)

Then, from this INPUT text, below :

GAP
1461 311
1826 311
2191 311
2556 311
2921 311
3287 311
3652 311
4017 311
4382 311
4748 311
5113 311
5478 311
5843 311
6209 311
6574 311
6939 311
7304 311
7670 311
8035 311
8400 311
8765 311
9131 311
9496 311
9861 311
10026 311
10592 311
10957 311
11322 311
11687 311
12053 311
12418 311
12783 308
13148 308
13514 308
13789 308
END

GAP
1461 303
1826 303
2191 303
2556 303
2921 303
3287 303
3652 303
4017 303
4382 303
4748 303
5113 303
5478 303
5843 303
6209 303
6574 303
6939 303
7304 303
7670 303
8035 303
8400 303
8765 303
9131 303
9496 303
9861 303
10026 303
10592 303
10957 303
11322 303
11687 303
12053 303
12418 303
12783 303
13148 303
13514 303
13789 303
END

GAP
1461 327.5
1826 327.5
2191 327.5
2556 327.5
2921 327.5
3287 327.5
3652 327.5
4017 327.5
4382 327.5
4748 327.5
5113 327.5
5478 327.5
5843 327.5
6209 327.5
6574 327.5
6939 327.5
7304 327.5
7670 327.5
8035 327.5
8400 327.5
8765 327.5
9131 327.5
9496 327.5
9861 327.5
10026 327.5
10592 327.5
10957 327.5
11322 327.5
11687 327.5
12053 327.5
12418 327.5
12783 327.5
13148 327.5
13514 327.5
13789 327.5
END

GAP
1461 302
1826 302
2191 299
2556 299
2921 299
3287 299
3652 299
4017 299
4382 299
4748 299
5113 299
5478 299
5843 299
6209 302
6574 302
6939 302
7304 302
7670 302
8035 302
8400 302
8765 302
9131 302
9496 302
9861 302
10026 302
10592 302
10957 302
11322 302
11687 302
12053 302
12418 302
12783 302
13148 302
13514 302
13789 302
END

The following regex S/R should do the work nicely :

SEARCH (?-is)^(GAP\R)(\d+ (\d+(?:\.\d+)?)\R)+?(?=\d+ (?!\3$))

REPLACE \1\2

And here is your expected OUTPUT text :

GAP
12418 311
12783 308
13148 308
13514 308
13789 308
END

GAP
1461 303
1826 303
2191 303
2556 303
2921 303
3287 303
3652 303
4017 303
4382 303
4748 303
5113 303
5478 303
5843 303
6209 303
6574 303
6939 303
7304 303
7670 303
8035 303
8400 303
8765 303
9131 303
9496 303
9861 303
10026 303
10592 303
10957 303
11322 303
11687 303
12053 303
12418 303
12783 303
13148 303
13514 303
13789 303
END

GAP
1461 327.5
1826 327.5
2191 327.5
2556 327.5
2921 327.5
3287 327.5
3652 327.5
4017 327.5
4382 327.5
4748 327.5
5113 327.5
5478 327.5
5843 327.5
6209 327.5
6574 327.5
6939 327.5
7304 327.5
7670 327.5
8035 327.5
8400 327.5
8765 327.5
9131 327.5
9496 327.5
9861 327.5
10026 327.5
10592 327.5
10957 327.5
11322 327.5
11687 327.5
12053 327.5
12418 327.5
12783 327.5
13148 327.5
13514 327.5
13789 327.5
END

GAP
1826 302
2191 299
2556 299
2921 299
3287 299
3652 299
4017 299
4382 299
4748 299
5113 299
5478 299
5843 299
6209 302
6574 302
6939 302
7304 302
7670 302
8035 302
8400 302
8765 302
9131 302
9496 302
9861 302
10026 302
10592 302
10957 302
11322 302
11687 302
12053 302
12418 302
12783 302
13148 302
13514 302
13789 302
END

Some other tests, all OK ! On the left the INPUT text and on the right the OUTPUT text :

GAP
4017 123
4382 123               GAP
4748 123               5478 123
5113 123               5843 456
5478 123               6209 456
5843 456         =>    6574 456
6209 456               6939 456
6574 456               7304 456
6939 456               END
7304 456
END

GAP
4017 123
4382 123               GAP
4748 123               5478 123
5113 123               5843 456.789
5478 123               6209 456.789
5843 456.789     =>    6574 456.789
6209 456.789           6939 456.789
6574 456.789           7304 456.789
6939 456.789           END
7304 456.789
END

GAP
4017 123
4382 123               GAP
4748 123               5478 123
5113 123               5843 123.789
5478 123               6209 123.789
5843 123.789     =>    6574 123.789
6209 123.789           6939 123.789
6574 123.789           7304 123.789
6939 123.789           END
7304 123.789
END

GAP
4017 123.012
4382 123.012           GAP
4748 123.012           5478 123.012
5113 123.012           5843 456
5478 123.012     =>    6209 456
5843 456               6574 456
6209 456               6939 456
6574 456               7304 456
6939 456               END
7304 456
END

GAP
4017 123.012
4382 123.012           GAP
4748 123.012           5478 123.012
5113 123.012           5843 123
5478 123.012     =>    6209 123
5843 123               6574 123
6209 123               6939 123
6574 123               7304 123
6939 123               END
7304 123
END

GAP
4017 123.012
4382 123.012           GAP
4748 123.012           5478 123.012
5113 123.012           5843 456.789
5478 123.012           6209 456.789
5843 456.789     =>    6574 456.789
6209 456.789           6939 456.789
6574 456.789           7304 456.789
6939 456.789           END
7304 456.789
END

GAP
4017 123               GAP
4382 123               4748 123
4748 123               5113 456
5113 456               5478 456
5478 456               5843 456
5843 456         =>    6209 456
6209 456               6574 123
6574 123               6939 123
6939 123               7304 123
7304 123               END
END

GAP
4017 123.012           GAP
4382 123.012           4748 123.012
4748 123.012           5113 456.789
5113 456.789           5478 456.789
5478 456.789           5843 456.789
5843 456.789     =>    6209 456.789
6209 456.789           6574 123.012
6574 123.012           6939 123.012
6939 123.012           7304 123.012
7304 123.012           END
END

Note that your file will be unchanged if you redo the search-replacement !!

Best Regards,

guy038