Wanting to extract specific info from several text documents
-
Greetings,
I have several text documents (saved data from a game) that I would like to extract specific information from. From these files I am wanting to extract all of 3-letter tags that follow the phrase "owner = " as well as the date that corresponds to each tag. For instance in this example:#140 - Bosnia owner = ROM controller = ROM culture = illyrian religion = druidism capital = "Servitium" trade_goods = salt hre = no discovered_by = ottoman discovered_by = middle_eastern discovered_by = muslim discovered_by = roman_group discovered_by = eastern discovered_by = barbarian discovered_by = western base_tax = 2 base_production = 2 base_manpower = 1 is_city = yes add_core = ROM 395.1.17 = { controller = ROW owner = ROW add_core = ROW remove_core = ROM } # Final division of the empire 450.1.1 = { religion = arianism } 455.1.1 = { owner = OST controller = OST add_core = OST remove_core = ROW culture = gothic } 540.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = OST } 610.1.1 = { add_core = BOS culture = bosnian religion = slavic } 630.1.1 = { owner = BOS controller = BOS remove_core = BYZ capital = "Doboj" } 869.1.1 = { religion = orthodox } 880.1.1 = { owner = SER controller = SER add_core = SER } 960.1.1 = { owner = CRO controller = CRO add_core = CRO remove_core = SER } 997.1.1 = { owner = BOS controller = BOS add_core = BOS } 1000.1.1 = { base_tax = 3 base_production = 3 } 1050.1.1 = { owner = CRO controller = CRO add_core = CRO } 1077.1.1 = { owner = DOC controller = DOC add_core = DOC remove_core = CRO } 1095.1.1 = { owner = SER controller = SER add_core = SER remove_core = DOC } 1136.1.1 = { owner = HUN controller = HUN add_core = HUN remove_core = SER } 1165.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = HUN } 1180.1.1 = { owner = BOS controller = BOS add_core = BOS remove_core = BYZ } 1305.1.1 = { owner = CRO controller = CRO add_core = CRO } 1322.1.1 = { owner = BOS controller = BOS remove_core = CRO } 1463.1.1 = { owner = TUR controller = TUR add_core = TUR } # The Ottoman province of Bosnia 1583.1.1 = { fort_16th = yes } 1593.1.1 = { unrest = 3 } # Fighting began in northwestern Bosnia, sparked Habsburg-Ottoman conflict 1606.1.1 = { unrest = 0 } # Temporarty peace 1683.1.1 = { unrest = 6 } # Heavy fighting & destruction in western Bosnia 1699.1.1 = { unrest = 0 } # Flood of Muslim refugees from Slavonia & Ottoman Hungary 1716.12.9 = { controller = HAB } # Occupied by Habsburg 1718.7.21 = { controller = TUR } 1737.7.1 = { controller = HAB } # Occupied by Habsburg again 1738.1.1 = { unrest = 5 } # The constant fighting, increased taxation caused tax revolts 1739.9.18 = { controller = TUR } # Treaty of Belgrade, Habsburg gave up its claim to the territory 1740.1.1 = { unrest = 8 } 1750.1.1 = { unrest = 0 } 1788.12.6 = { controller = HAB } # Habsburg invasion 1791.8.4 = { controller = TUR } # Treaty of Sistova 1875.11.2 = { revolt = { type = nationalist_rebels size = 1 } controller = REB add_core = SER } 1877.8.4 = { revolt = {} controller = TUR } 1878.7.13 = { owner = HAB controller = HAB add_core = HAB } 1908.10.5 = { remove_core = TUR } 1910.1.1 = { discovered_by = asian } 1918.12.1 = { owner = YUG controller = YUG add_core = YUG add_core = BHE remove_core = HAB remove_core = BOS } 1941.4.6 = { owner = GER controller = GER } 1941.4.10 = { owner = CRO controller = CRO add_core = CRO } 1945.5.8 = { owner = YUG controller = YUG remove_core = CRO } 1992.3.3 = { owner = BHE controller = BHE remove_core = YUG remove_core = SER }
I would then want to extract the following:
395.1.17 ROW 455.1.1 OST 540.1.1 BYZ 630.1.1 BOS 880.1.1 SER 960.1.1 CRO 997.1.1 BOS 1050.1.1 CRO 1077.1.1 DOC 1095.1.1 SER 1136.1.1 HUN 1165.1.1 BYZ 1180.1.1 BOS 1305.1.1 CRO 1322.1.1 BOS 1463.1.1 TUR 1878.7.13 HAB 1918.12.1 YUG 1941.4.6 GER 1941.4.10 CRO 1945.5.8 YUG 1992.3.3 BHE
All I am wanting is just the “owner” tags and the dates associated with them. As one can see, the format of the document tend to be along the lines of Y.M.D = {}
One might notice that there is also an unbound "owner = " at the top of the document. Not every one of the text documents has one of these within it, and so it can either be ignored or added to the output listing without a date, or if possible, with “2.1.1” as a placeholder date.
I hope this is clear enough as to what it is I am wanting to extract from these files, please let me know how I can give more clarification on what I am trying to get here if it is not clear.Thanks!
-
@Matthew-Wheaton said in Wanting to extract specific info from several text documents:
All I am wanting
Is that all? ;-)
- copy all text to a new file (or copy all the files to a new directory) and edit the new copies, because we’re going to be deleting data, and we don’t want it lost from the original (especially if something goes wrong)
- format all entries (single-line or multiline) that have both the date and the owner
FIND =^(\d+\.\d+\.\d+)\h*=\h*{[^}]*owner\h*=\h*(\w+)[^}]*}(?-s:.*)$
REPLACE =☺$1 $2
SEARCH MODE = regular expression
REPLACE ALL - Delete any line that doesn’t start with a smiley
FIND =^(?!☺).*(\R|\Z)
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL - Delete the smileys:
FIND =^☺
REPLACE = empty field
SEARCH MODE = regular expression
REPLACE ALL
This ignored the owner that wasn’t inside {}s
There are probably formatting exceptions that my regex won’t handle. This is a good starting point; anything beyond this, I’m going to leave as an exercise to the reader
----
Useful References
- Please Read Before Posting
- Template for Search/Replace Questions
- Formatting Forum Posts
- Notepad++ Online User Manual: Searching/Regex
- FAQ: Where to find other regular expressions (regex) documentation
----
Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.
-
@matto-wh said in Wanting to extract specific info from several text documents:
please let me know how I can give more clarification on what I am trying to get here if it is not clear.
A question, almost all the lines are of the type “date” followed by information within the { and } on 1 line. But one seems to be on multiple lines. This example hasn’t been altered at all has it? It is important for anyone who wishes to help you to be able to trust what you show.
Also, how many files are you needing to accomplished the extraction on?
Having said that it would seem that a method of getting the data might be to (this would use copies of the files since we are destroying data within them):
- remove all line feeds (if sets can appear on multiple lines)
- arrange the data so each line contains 1 “date” and “owner =” set
- remove all other text
Note this is just a high level description. That’s how I operate. Think the solution in my home language, then the coding (in whatever computer language is required) can be more easily accomplished.
Terry
PS just about to post when I see @PeterJones post, very similar to my concept.
-
This example hasn’t been altered at all has it?
No I have not altered the example at all. That is part of the issue I have is that in some of the documents there are instances like this of there being multiple lines, or where the "owner = " is not the first item listed.
Also, how many files are you needing to accomplished the extraction on?
It’s a couple thousand files. Hence my wanting to automate it in some capacity. The “date” and “owner” sets are the data I am trying to extract from these files. Destroying all the other text in the (copies of the) files aside from the dates and owner pairings would likely be the most effective means of doing this I think. I am not very familiar with programming/coding or even Notepad++, so I would like to know how exactly I would go about making it do such a thing?
Thanks
-
@matto-wh said in Wanting to extract specific info from several text documents:
so I would like to know how exactly I would go about making it do such a thing
Well @PeterJones solution (which is coded as a regular expression (regex) and ready to try) would be a good starting point.
Often when we provide such regex we can easily code for the majority of situations, but we sometimes aren’t aware of edge cases. Your one in the example where the “set” is on multiple lines is one such case, albeit it is shown so can be catered for.
Do you have other means to verify the completeness of the data extracted? It’s easy to provide regex but we have to leave it to the requester to verify that it meets their needs.
You mentioned over 1000 files to be consumed. I wouldn’t suggest running a provided solution on that many until you have tested it on 1 file where you have independently solved it by other means and confirmed the result using a regex.
So give the provided solution a go. Very possibly (as @PeterJones mentioned) you may need to return to post again in this thread and show other edge cases you weren’t aware of. From that someone could further refine the solution until you were sure of the result.
Terry
-
With a little preprocessing, my JsonTools plugin can handle this.
I think that PeterJones’ solution is fine, and with a little tweaking it can handle weirder formatting, so don’t try this unless you get desperate. If your file is so pathological that JsonTools can’t help you, IDK, maybe try PythonScript.
- Add a
{
(open curly brace) as the first character of the file - Find/replace
[\w\.]+
with"$0"
(regular expressions on) - Find/replace
=
with:
- Use JsonTools to pretty-print the file, since it is now reasonably close to syntactically valid JSON
- Open up the JSON tree view.
- Run the following query:
items(@.g`^[\\d\\.]+$`.owner)[:]{date:@[0], country:@[1]}
- Use the JSON-to-CSV form to dump the result of the query into a CSV file.
- The result isn’t exactly what you wanted, but it’s close enough, and it’s in CSV format, which is intrinsically useful.
country,date CRO,1050.1.1 DOC,1077.1.1 SER,1095.1.1 HUN,1136.1.1 BYZ,1165.1.1 BOS,1180.1.1 CRO,1305.1.1 BOS,1322.1.1 TUR,1463.1.1 HAB,1878.7.13 YUG,1918.12.1 CRO,1941.4.10 GER,1941.4.6 YUG,1945.5.8 BHE,1992.3.3 ROW,395.1.17 OST,455.1.1 BYZ,540.1.1 BOS,630.1.1 SER,880.1.1 CRO,960.1.1 BOS,997.1.1
- Add a
-
@Mark-Olson said in Wanting to extract specific info from several text documents:
With a little preprocessing, my JsonTools plugin can handle this.
Wouldn’t you have to do it individually for each of the “couple thousand files”? Or can the JsonTools plugin run the same search/replace across “all open files” or “all files in folder” akin to “Find in Files”?
Because my native regex search/replace version can work in “Find in Files”, so it would do each of steps across the thousands of files without any user interaction (and the files don’t even have to be open in Notepad++).
-
@PeterJones
Oh, I missed that part. Yeah, that makes my cute solution impractical.The JSON-from-files-and-APIs form (which I just call the grepper form) can run the query on thousands of files, but;
- it wouldn’t actually replace the contents of the file
- it would only output JSON anyway
- You’d have to do the initial preprocessing beforehand, at which point you’re already using some kind of grep tool so why bother using JsonTools
Nobody has ever asked me to implement a feature that could do that, and I’ve never seen the need.
That said, if for whatever reason people want to see how to solve this using a combination of JsonTools and PythonScript, I can show you.
-
@PeterJones
Fantastic! That is exactly what I have been wanting! Thank you very much!