Wanting to extract specific info from several text documents

Reply to Wanting to extract specific info from several text documents on Fri, 21 Jul 2023 13:11:48 GMT

matto.wh — Fri, 21 Jul 2023 13:11:48 GMT

@PeterJones
Fantastic! That is exactly what I have been wanting! Thank you very much!

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 22:06:46 GMT

Mark Olson — Thu, 20 Jul 2023 22:06:46 GMT

@PeterJones
Oh, I missed that part. Yeah, that makes my cute solution impractical.

The JSON-from-files-and-APIs form (which I just call the grepper form) can run the query on thousands of files, but;

it wouldn’t actually replace the contents of the file
it would only output JSON anyway
You’d have to do the initial preprocessing beforehand, at which point you’re already using some kind of grep tool so why bother using JsonTools

Nobody has ever asked me to implement a feature that could do that, and I’ve never seen the need.

That said, if for whatever reason people want to see how to solve this using a combination of JsonTools and PythonScript, I can show you.

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 21:39:12 GMT

PeterJones — Thu, 20 Jul 2023 21:39:12 GMT

@Mark-Olson said in Wanting to extract specific info from several text documents:

With a little preprocessing, my JsonTools plugin can handle this.

Wouldn’t you have to do it individually for each of the “couple thousand files”? Or can the JsonTools plugin run the same search/replace across “all open files” or “all files in folder” akin to “Find in Files”?

Because my native regex search/replace version can work in “Find in Files”, so it would do each of steps across the thousands of files without any user interaction (and the files don’t even have to be open in Notepad++).

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 21:35:11 GMT

Mark Olson — Thu, 20 Jul 2023 21:35:11 GMT

With a little preprocessing, my JsonTools plugin can handle this.

I think that PeterJones’ solution is fine, and with a little tweaking it can handle weirder formatting, so don’t try this unless you get desperate. If your file is so pathological that JsonTools can’t help you, IDK, maybe try PythonScript.

Add a { (open curly brace) as the first character of the file
Find/replace [\w\.]+ with "$0" (regular expressions on)
Find/replace = with :
Use JsonTools to pretty-print the file, since it is now reasonably close to syntactically valid JSON
Open up the JSON tree view.
Run the following query: items(@.g`^[\\d\\.]+$`.owner)[:]{date:@[0], country:@[1]}
Use the JSON-to-CSV form to dump the result of the query into a CSV file.
The result isn’t exactly what you wanted, but it’s close enough, and it’s in CSV format, which is intrinsically useful.

country,date
CRO,1050.1.1
DOC,1077.1.1
SER,1095.1.1
HUN,1136.1.1
BYZ,1165.1.1
BOS,1180.1.1
CRO,1305.1.1
BOS,1322.1.1
TUR,1463.1.1
HAB,1878.7.13
YUG,1918.12.1
CRO,1941.4.10
GER,1941.4.6
YUG,1945.5.8
BHE,1992.3.3
ROW,395.1.17
OST,455.1.1
BYZ,540.1.1
BOS,630.1.1
SER,880.1.1
CRO,960.1.1
BOS,997.1.1

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 21:26:23 GMT

Terry R — Thu, 20 Jul 2023 21:26:23 GMT

@matto-wh said in Wanting to extract specific info from several text documents:

so I would like to know how exactly I would go about making it do such a thing

Well @PeterJones solution (which is coded as a regular expression (regex) and ready to try) would be a good starting point.

Often when we provide such regex we can easily code for the majority of situations, but we sometimes aren’t aware of edge cases. Your one in the example where the “set” is on multiple lines is one such case, albeit it is shown so can be catered for.

Do you have other means to verify the completeness of the data extracted? It’s easy to provide regex but we have to leave it to the requester to verify that it meets their needs.

You mentioned over 1000 files to be consumed. I wouldn’t suggest running a provided solution on that many until you have tested it on 1 file where you have independently solved it by other means and confirmed the result using a regex.

So give the provided solution a go. Very possibly (as @PeterJones mentioned) you may need to return to post again in this thread and show other edge cases you weren’t aware of. From that someone could further refine the solution until you were sure of the result.

Terry

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 21:06:38 GMT

matto.wh — Thu, 20 Jul 2023 21:06:38 GMT

@Terry-R

This example hasn’t been altered at all has it?

No I have not altered the example at all. That is part of the issue I have is that in some of the documents there are instances like this of there being multiple lines, or where the "owner = " is not the first item listed.

Also, how many files are you needing to accomplished the extraction on?

It’s a couple thousand files. Hence my wanting to automate it in some capacity. The “date” and “owner” sets are the data I am trying to extract from these files. Destroying all the other text in the (copies of the) files aside from the dates and owner pairings would likely be the most effective means of doing this I think. I am not very familiar with programming/coding or even Notepad++, so I would like to know how exactly I would go about making it do such a thing?

Thanks

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 20:32:30 GMT

Terry R — Thu, 20 Jul 2023 20:32:30 GMT

@matto-wh said in Wanting to extract specific info from several text documents:

please let me know how I can give more clarification on what I am trying to get here if it is not clear.

A question, almost all the lines are of the type “date” followed by information within the { and } on 1 line. But one seems to be on multiple lines. This example hasn’t been altered at all has it? It is important for anyone who wishes to help you to be able to trust what you show.

Also, how many files are you needing to accomplished the extraction on?

Having said that it would seem that a method of getting the data might be to (this would use copies of the files since we are destroying data within them):

remove all line feeds (if sets can appear on multiple lines)
arrange the data so each line contains 1 “date” and “owner =” set
remove all other text

Note this is just a high level description. That’s how I operate. Think the solution in my home language, then the coding (in whatever computer language is required) can be more easily accomplished.

Terry

PS just about to post when I see @PeterJones post, very similar to my concept.

Reply to Wanting to extract specific info from several text documents on Thu, 20 Jul 2023 20:30:38 GMT

PeterJones — Thu, 20 Jul 2023 20:30:38 GMT

@Matthew-Wheaton said in Wanting to extract specific info from several text documents:

All I am wanting

Is that all? ;-)

copy all text to a new file (or copy all the files to a new directory) and edit the new copies, because we’re going to be deleting data, and we don’t want it lost from the original (especially if something goes wrong)
format all entries (single-line or multiline) that have both the date and the owner
FIND = ^(\d+\.\d+\.\d+)\h*=\h*{[^}]*owner\h*=\h*(\w+)[^}]*}(?-s:.*)$
REPLACE = ☺$1 $2
SEARCH MODE = regular expression
REPLACE ALL
Delete any line that doesn’t start with a smiley
FIND = ^(?!☺).*(\R|\Z)
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL
Delete the smileys:
FIND = ^☺
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL

This ignored the owner that wasn’t inside {}s

There are probably formatting exceptions that my regex won’t handle. This is a good starting point; anything beyond this, I’m going to leave as an exercise to the reader

----

Useful References

----

Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.