How to extract certain columns from a big Notepad text file?

zarate petery

I have a big text file and the data in it are in 5 columns, but I need the third column and the hash

 1.0000000000000000         0.0000000000 A e339e33ef20d66617d7d15418bd6526d
 1.5000000000000000         0.3010299957 C 2e0d7e0fcd9576ca19b17741c2ebc1bb
 1.7500000000000000         0.6020599913 A 8df03880b2549e805d206f1ae9ed9f57
 2.0000000000000000         0.7781512504 C cbb45c672024e143d6bb4c34f4d38370
 2.3333333333333333         1.0791812460 C 574e71516af6a72a08a8b3aa510d6ec2
 2.5000000000000000         1.3802112417 A fb43f2edf9fca48c64726094cbf1ddb3
 2.5277777777777778         1.5563025008 A 1f95c74c21f7de6014ea6f6a0f2e210d
 2.5833333333333333         1.6812412374 A 6af749e09be024edeba840487834c5e7
 2.8000000000000000         1.7781512504 C 02312f9edd74024c66e609f0a34f2229
3.0000000000000000         2.0791812460 C c10d20655944bcb3d78d99be49846785

I have a big text file and the data in it are in 5 columns, but I need the third column and the hash

like this

0.0000000000 e339e33ef20d66617d7d15418bd6526d

Alan Kilborn

@zarate-petery

I think you have FOUR columns and you want the SECOND and the hash, but, bad problem description aside, I might be tempted to do a regular expression (search mode) replacement.

I would search for ^\s*\S+\s+(\S+)\s+\S+\s+(\S+)
and replace that with \1\x20\2

Disclaimer: I didn’t actually try it. :-)

Terry R

@zarate-petery said in How to extract certain columns from a big Notepad text file?:

but I need the third column and the hash

Like @Alan-Kilborn , I too had a slight problem with your description. However since there is a wide area of “no” data (lots of spaces) I concluded this is actually the 2nd column. If true this causes further problems as if data can appear in here it is currently hard to “guess” what that data might be (letters, numbers or whatever). So I have made an assumption that it does NOT have any numbers in this area.
Other than the above issue your data looks to be ordered well; well sort of; except for the last line where it is 1 column to the left from all other lines. As a result my first regex is having to be a bit more selective in how to grab the data.

So using the Replace" function (search mode is regular expression) we have -
Find What:(?-s)^\h?.{18,20}[^0-9]+(\d+\.\d+\h+).\h+(.+\R?)
Replace With:\1\2

Now if ALL lines were ordered so that each field started in exactly the same column (that field started on exactly the same column number as the previous line and was exactly the same length) we could simplify the regex somewhat. We’d have for the Replace function -
Find What:(?-s)^.{28}(.{12}).{2}(.{33}\R?)
Replace with:\1\2

So if you check your data and confirm that the fields are actually in “orderly” columns then the 2nd regex will work. If mine and/or @Alan-Kilborn assumptions aren’t on point please do give us more information. We are happy to help but to do so we need ALL the facts. In this case it would be examples containing all 5 columns (so we know what characters are in each), whether the columns are exactly same length each line and do they ALWAYS start on the same column number.

Terry

Alan Kilborn

A nod to @Terry-R for the use of \h rather than my use of \s to match horizontal whitespace, although the \s actually should work here.