Remove data lines with same value in certain column
-
Hello,
I have a very large dataset with over 10 million lines of data.
The data format is shown in screenshot below
I have more than 400,000 set of these data line by line in this text dataset file. The first 4 lines in the example are like headers and the data I want to clean up starts with 1461 302; 1826 302, etc. In the second column, values start with 302, then change to 287.13, then return back to 302. Different small dataset have different values. But the pattern is the same.
I want to remove the lines where the second column value are the same before it changes to a different value. And I want to keep the last row of data before second column value changes.
So my desired data format will be as below:
I successfully did this operation last year using Notepad++ but forgot how I did it. What I remember was I used the Search/Replace function and Line Operation and Bookmark function to first select all the rows I want to delete, then delete them.
Please help me with this issue. Very much appreciated!
-
I’m a bit confused. In the example you gave, the desired results you showed don’t match the description you gave. Were the actual desired results:
#391393 ! Node 990292 Mine Year2Ø33 ! [type—Linear ; option—linear ; timeunit=d ; unitc1ass=CARDINAL; userunit=] 748 302 8765 287.13 13879 302 END
If so, you can do that by selecting Search | Replace… from the menu, entering:
Find what:
^\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
Replace with:\2 \1
Wrap around checked
Search Mode: Regular expressionand clicking Replace All.
This assumes that, as in your example data, the first column is all integers but the second column can have decimals.
-
Hi, thank you so much for your quick reply.
Sorry my description is off and hard to understand. But the desired data format in the second screenshot is correct.
Hopefully picture below can clarify my question better:
I want to delete the rows circled in red and keep the rest.
Thank you!!!
-
I feel like I might still be missing something.
Are you saying that you only want to delete the lines with duplicate second values in the first group that follows a line with only the letters GAP in it?
If that’s the case, then try changing my suggestion to:
Find what:
^GAP\R\d+ ([\d.]+)\R(?:\d+ \1\R)*(\d+) \1$
Replace with:GAP\r\n\2 \1
I’m assuming you want to replace this more than once (or you’d just do it by hand) — either it occurs multiple times in one file, or one time in multiple files, or multiple times in multiple files — but I’m having trouble grasping what differentiates the specific group of lines you want to remove from the other groups of lines that follow the same pattern.
-
Hello, @jizhao-li, @coises and All,
If I fully understand what you expect to, this means that :
-
After a GAP indication, you want to replace all lines of text, before a first line with a decimal number, by the last line with integer numbers only
-
And the text, from the first line with a decimal number, till the next END indication, is not changed at all
If so, the regex S/R, below, should do the work, nicely :
SEARCH
(?-s)^(GAP\R)(\d+ \d+\R)+(?=.+\.)
REPLACE
$1$2
Note that this regex assumes that no blank character exists at the end of all the lines, before the first line with a decimal number !
Else, just use the
Edit > Blank Operations > Trim Trailing Space
to suppress these extra blank characters at end of lines
As you can see, only the last occurrence of the group
\d+ \d+\R
is re-used, in the replacement phase, with the$2
syntaxBest regards
guy038
-
-
If you are working with single files that have 10 million lines of data in a tabular format, I strongly recommend that you use a scripting language like Python (specifically a library like pandas or polars) to manipulate those files, rather than using regular expressions in Notepad++. Not only is Python well-suited for automation of such tasks, a Python script with the right packages would probably perform the desired manipulations many times faster than a comparable solution in Notepad++.
This is especially true if you are working with many such files, since the Find/Replace in Files functionality in Notepad++ is rather slow compared to many alternatives.
To be clear, I’m not dumping on Notepad++ in general, simply saying that its regex search/replace functionality does not scale very well to extremely large files or large numbers of files. I quite like the flexibility of Notepad++ regex search/replace for modestly sized files.
-
Hi, @jizhao-li, @coises, @mark-olson and All,
Despite the legitimate warning, provided by @mark-olson, I was under the impression that my replacement would work correctly, even with a file of
10 million
lines-
Since each character, less than
\x{80}
, is coded with one byte only -
Since the average length of each line is around
9.9
characters, only
Thus, I created such a file containing
897,000
lines for10,668,000
bytes and… guess what ?=> Even with my out of date
XP
laptop, it took about13.6 s
only, to modify the15,000
occurrences of theTest_Jizhao.txt
file !After replacement, this file now contains
561,000
lines for7,308,000
bytes. And, of course, with a regex, I verified that between aGAP
line and the first line with decimal number, only one line with two integer numbers existsBest Regards,
guy038
-