Wanting to extract specific info from several text documents

matto.wh

Greetings,
I have several text documents (saved data from a game) that I would like to extract specific information from. From these files I am wanting to extract all of 3-letter tags that follow the phrase "owner = " as well as the date that corresponds to each tag. For instance in this example:

#140 - Bosnia

owner = ROM
controller = ROM
culture = illyrian
religion = druidism
capital = "Servitium"
trade_goods = salt
hre = no
discovered_by = ottoman
discovered_by = middle_eastern
discovered_by = muslim
discovered_by = roman_group
discovered_by = eastern
discovered_by = barbarian
discovered_by = western
base_tax = 2
base_production = 2
base_manpower = 1
is_city = yes
add_core = ROM

395.1.17 = { controller = ROW owner = ROW add_core = ROW remove_core = ROM } # Final division of the empire
450.1.1 = { religion = arianism }
455.1.1 = { owner = OST controller = OST add_core = OST remove_core = ROW culture = gothic }
540.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = OST }
610.1.1 = { add_core = BOS culture = bosnian religion = slavic }
630.1.1 = { owner = BOS controller = BOS remove_core = BYZ capital = "Doboj" }
869.1.1 = { religion = orthodox }
880.1.1 = { owner = SER controller = SER add_core = SER }
960.1.1 = { owner = CRO controller = CRO add_core = CRO remove_core = SER }
997.1.1 = { owner = BOS controller = BOS add_core = BOS }
1000.1.1 = { base_tax = 3 base_production = 3 }
1050.1.1 = { owner = CRO controller = CRO add_core = CRO }
1077.1.1 = { owner = DOC controller = DOC add_core = DOC remove_core = CRO }
1095.1.1 = { owner = SER controller = SER add_core = SER remove_core = DOC }
1136.1.1 = { owner = HUN controller = HUN add_core = HUN remove_core = SER }
1165.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = HUN }
1180.1.1 = { owner = BOS controller = BOS add_core = BOS remove_core = BYZ }
1305.1.1 = { owner = CRO controller = CRO add_core = CRO }
1322.1.1 = { owner = BOS controller = BOS remove_core = CRO }

1463.1.1 = {
	owner = TUR
	controller = TUR
	add_core = TUR
} # The Ottoman province of Bosnia
1583.1.1 = { fort_16th = yes }
1593.1.1 = { unrest = 3 } # Fighting began in northwestern Bosnia, sparked Habsburg-Ottoman conflict
1606.1.1 = { unrest = 0 } # Temporarty peace
1683.1.1 = { unrest = 6 } # Heavy fighting & destruction in western Bosnia
1699.1.1 = { unrest = 0 } # Flood of Muslim refugees from Slavonia & Ottoman Hungary 
1716.12.9 = { controller = HAB } # Occupied by Habsburg
1718.7.21 = { controller = TUR }
1737.7.1 = { controller = HAB } # Occupied by Habsburg again
1738.1.1 = { unrest = 5 } # The constant fighting, increased taxation caused tax revolts
1739.9.18 = { controller = TUR } # Treaty of Belgrade, Habsburg gave up its claim to the territory
1740.1.1 = { unrest = 8 }
1750.1.1 = { unrest = 0 }
1788.12.6 = { controller = HAB } # Habsburg invasion
1791.8.4 = { controller = TUR } # Treaty of Sistova

1875.11.2  = { revolt = { type = nationalist_rebels size = 1 } controller = REB add_core = SER }
1877.8.4   = { revolt = {} controller = TUR }
1878.7.13  = { owner = HAB controller = HAB add_core = HAB }
1908.10.5  = { remove_core = TUR }
1910.1.1 = { discovered_by = asian }
1918.12.1  = { owner = YUG controller = YUG add_core = YUG add_core = BHE remove_core = HAB remove_core = BOS }
1941.4.6   = { owner = GER controller = GER }
1941.4.10  = { owner = CRO controller = CRO add_core = CRO }
1945.5.8   = { owner = YUG controller = YUG remove_core = CRO }
1992.3.3   = { owner = BHE controller = BHE remove_core = YUG remove_core = SER }

I would then want to extract the following:

395.1.17 ROW
455.1.1 OST
540.1.1 BYZ
630.1.1 BOS
880.1.1 SER
960.1.1 CRO
997.1.1 BOS
1050.1.1 CRO
1077.1.1 DOC
1095.1.1 SER
1136.1.1 HUN
1165.1.1 BYZ
1180.1.1 BOS
1305.1.1 CRO
1322.1.1 BOS
1463.1.1 TUR
1878.7.13 HAB
1918.12.1 YUG
1941.4.6 GER
1941.4.10 CRO
1945.5.8 YUG
1992.3.3 BHE

All I am wanting is just the “owner” tags and the dates associated with them. As one can see, the format of the document tend to be along the lines of Y.M.D = {}
One might notice that there is also an unbound "owner = " at the top of the document. Not every one of the text documents has one of these within it, and so it can either be ignored or added to the output listing without a date, or if possible, with “2.1.1” as a placeholder date.
I hope this is clear enough as to what it is I am wanting to extract from these files, please let me know how I can give more clarification on what I am trying to get here if it is not clear.

Thanks!

PeterJones

@Matthew-Wheaton said in Wanting to extract specific info from several text documents:

All I am wanting

Is that all? ;-)

copy all text to a new file (or copy all the files to a new directory) and edit the new copies, because we’re going to be deleting data, and we don’t want it lost from the original (especially if something goes wrong)
format all entries (single-line or multiline) that have both the date and the owner
FIND = ^(\d+\.\d+\.\d+)\h*=\h*{[^}]*owner\h*=\h*(\w+)[^}]*}(?-s:.*)$
REPLACE = ☺$1 $2
SEARCH MODE = regular expression
REPLACE ALL
Delete any line that doesn’t start with a smiley
FIND = ^(?!☺).*(\R|\Z)
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL
Delete the smileys:
FIND = ^☺
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL

This ignored the owner that wasn’t inside {}s

There are probably formatting exceptions that my regex won’t handle. This is a good starting point; anything beyond this, I’m going to leave as an exercise to the reader

----

Useful References

----

Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

Terry R

@matto-wh said in Wanting to extract specific info from several text documents:

please let me know how I can give more clarification on what I am trying to get here if it is not clear.

A question, almost all the lines are of the type “date” followed by information within the { and } on 1 line. But one seems to be on multiple lines. This example hasn’t been altered at all has it? It is important for anyone who wishes to help you to be able to trust what you show.

Also, how many files are you needing to accomplished the extraction on?

Having said that it would seem that a method of getting the data might be to (this would use copies of the files since we are destroying data within them):

remove all line feeds (if sets can appear on multiple lines)
arrange the data so each line contains 1 “date” and “owner =” set
remove all other text

Note this is just a high level description. That’s how I operate. Think the solution in my home language, then the coding (in whatever computer language is required) can be more easily accomplished.

Terry

PS just about to post when I see @PeterJones post, very similar to my concept.

matto.wh

@Terry-R

This example hasn’t been altered at all has it?

No I have not altered the example at all. That is part of the issue I have is that in some of the documents there are instances like this of there being multiple lines, or where the "owner = " is not the first item listed.

Also, how many files are you needing to accomplished the extraction on?

It’s a couple thousand files. Hence my wanting to automate it in some capacity. The “date” and “owner” sets are the data I am trying to extract from these files. Destroying all the other text in the (copies of the) files aside from the dates and owner pairings would likely be the most effective means of doing this I think. I am not very familiar with programming/coding or even Notepad++, so I would like to know how exactly I would go about making it do such a thing?

Thanks

Terry R

@matto-wh said in Wanting to extract specific info from several text documents:

so I would like to know how exactly I would go about making it do such a thing

Well @PeterJones solution (which is coded as a regular expression (regex) and ready to try) would be a good starting point.

Often when we provide such regex we can easily code for the majority of situations, but we sometimes aren’t aware of edge cases. Your one in the example where the “set” is on multiple lines is one such case, albeit it is shown so can be catered for.

Do you have other means to verify the completeness of the data extracted? It’s easy to provide regex but we have to leave it to the requester to verify that it meets their needs.

You mentioned over 1000 files to be consumed. I wouldn’t suggest running a provided solution on that many until you have tested it on 1 file where you have independently solved it by other means and confirmed the result using a regex.

So give the provided solution a go. Very possibly (as @PeterJones mentioned) you may need to return to post again in this thread and show other edge cases you weren’t aware of. From that someone could further refine the solution until you were sure of the result.

Terry

Mark Olson

With a little preprocessing, my JsonTools plugin can handle this.

I think that PeterJones’ solution is fine, and with a little tweaking it can handle weirder formatting, so don’t try this unless you get desperate. If your file is so pathological that JsonTools can’t help you, IDK, maybe try PythonScript.

Add a { (open curly brace) as the first character of the file
Find/replace [\w\.]+ with "$0" (regular expressions on)
Find/replace = with :
Use JsonTools to pretty-print the file, since it is now reasonably close to syntactically valid JSON
Open up the JSON tree view.
Run the following query: items(@.g`^[\\d\\.]+$`.owner)[:]{date:@[0], country:@[1]}
Use the JSON-to-CSV form to dump the result of the query into a CSV file.
The result isn’t exactly what you wanted, but it’s close enough, and it’s in CSV format, which is intrinsically useful.

country,date
CRO,1050.1.1
DOC,1077.1.1
SER,1095.1.1
HUN,1136.1.1
BYZ,1165.1.1
BOS,1180.1.1
CRO,1305.1.1
BOS,1322.1.1
TUR,1463.1.1
HAB,1878.7.13
YUG,1918.12.1
CRO,1941.4.10
GER,1941.4.6
YUG,1945.5.8
BHE,1992.3.3
ROW,395.1.17
OST,455.1.1
BYZ,540.1.1
BOS,630.1.1
SER,880.1.1
CRO,960.1.1
BOS,997.1.1

PeterJones

@Mark-Olson said in Wanting to extract specific info from several text documents:

With a little preprocessing, my JsonTools plugin can handle this.

Wouldn’t you have to do it individually for each of the “couple thousand files”? Or can the JsonTools plugin run the same search/replace across “all open files” or “all files in folder” akin to “Find in Files”?

Because my native regex search/replace version can work in “Find in Files”, so it would do each of steps across the thousands of files without any user interaction (and the files don’t even have to be open in Notepad++).

Mark Olson

@PeterJones
Oh, I missed that part. Yeah, that makes my cute solution impractical.

The JSON-from-files-and-APIs form (which I just call the grepper form) can run the query on thousands of files, but;

it wouldn’t actually replace the contents of the file
it would only output JSON anyway
You’d have to do the initial preprocessing beforehand, at which point you’re already using some kind of grep tool so why bother using JsonTools

Nobody has ever asked me to implement a feature that could do that, and I’ve never seen the need.

That said, if for whatever reason people want to see how to solve this using a combination of JsonTools and PythonScript, I can show you.

matto.wh

@PeterJones
Fantastic! That is exactly what I have been wanting! Thank you very much!