Community
    • Login

    Wanting to extract specific info from several text documents

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • matto.whM
      matto.wh
      last edited by

      Greetings,
      I have several text documents (saved data from a game) that I would like to extract specific information from. From these files I am wanting to extract all of 3-letter tags that follow the phrase "owner = " as well as the date that corresponds to each tag. For instance in this example:

      #140 - Bosnia
      
      owner = ROM
      controller = ROM
      culture = illyrian
      religion = druidism
      capital = "Servitium"
      trade_goods = salt
      hre = no
      discovered_by = ottoman
      discovered_by = middle_eastern
      discovered_by = muslim
      discovered_by = roman_group
      discovered_by = eastern
      discovered_by = barbarian
      discovered_by = western
      base_tax = 2
      base_production = 2
      base_manpower = 1
      is_city = yes
      add_core = ROM
      
      395.1.17 = { controller = ROW owner = ROW add_core = ROW remove_core = ROM } # Final division of the empire
      450.1.1 = { religion = arianism }
      455.1.1 = { owner = OST controller = OST add_core = OST remove_core = ROW culture = gothic }
      540.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = OST }
      610.1.1 = { add_core = BOS culture = bosnian religion = slavic }
      630.1.1 = { owner = BOS controller = BOS remove_core = BYZ capital = "Doboj" }
      869.1.1 = { religion = orthodox }
      880.1.1 = { owner = SER controller = SER add_core = SER }
      960.1.1 = { owner = CRO controller = CRO add_core = CRO remove_core = SER }
      997.1.1 = { owner = BOS controller = BOS add_core = BOS }
      1000.1.1 = { base_tax = 3 base_production = 3 }
      1050.1.1 = { owner = CRO controller = CRO add_core = CRO }
      1077.1.1 = { owner = DOC controller = DOC add_core = DOC remove_core = CRO }
      1095.1.1 = { owner = SER controller = SER add_core = SER remove_core = DOC }
      1136.1.1 = { owner = HUN controller = HUN add_core = HUN remove_core = SER }
      1165.1.1 = { owner = BYZ controller = BYZ add_core = BYZ remove_core = HUN }
      1180.1.1 = { owner = BOS controller = BOS add_core = BOS remove_core = BYZ }
      1305.1.1 = { owner = CRO controller = CRO add_core = CRO }
      1322.1.1 = { owner = BOS controller = BOS remove_core = CRO }
      
      1463.1.1 = {
      	owner = TUR
      	controller = TUR
      	add_core = TUR
      } # The Ottoman province of Bosnia
      1583.1.1 = { fort_16th = yes }
      1593.1.1 = { unrest = 3 } # Fighting began in northwestern Bosnia, sparked Habsburg-Ottoman conflict
      1606.1.1 = { unrest = 0 } # Temporarty peace
      1683.1.1 = { unrest = 6 } # Heavy fighting & destruction in western Bosnia
      1699.1.1 = { unrest = 0 } # Flood of Muslim refugees from Slavonia & Ottoman Hungary 
      1716.12.9 = { controller = HAB } # Occupied by Habsburg
      1718.7.21 = { controller = TUR }
      1737.7.1 = { controller = HAB } # Occupied by Habsburg again
      1738.1.1 = { unrest = 5 } # The constant fighting, increased taxation caused tax revolts
      1739.9.18 = { controller = TUR } # Treaty of Belgrade, Habsburg gave up its claim to the territory
      1740.1.1 = { unrest = 8 }
      1750.1.1 = { unrest = 0 }
      1788.12.6 = { controller = HAB } # Habsburg invasion
      1791.8.4 = { controller = TUR } # Treaty of Sistova
      
      1875.11.2  = { revolt = { type = nationalist_rebels size = 1 } controller = REB add_core = SER }
      1877.8.4   = { revolt = {} controller = TUR }
      1878.7.13  = { owner = HAB controller = HAB add_core = HAB }
      1908.10.5  = { remove_core = TUR }
      1910.1.1 = { discovered_by = asian }
      1918.12.1  = { owner = YUG controller = YUG add_core = YUG add_core = BHE remove_core = HAB remove_core = BOS }
      1941.4.6   = { owner = GER controller = GER }
      1941.4.10  = { owner = CRO controller = CRO add_core = CRO }
      1945.5.8   = { owner = YUG controller = YUG remove_core = CRO }
      1992.3.3   = { owner = BHE controller = BHE remove_core = YUG remove_core = SER }
      
      

      I would then want to extract the following:

      395.1.17 ROW
      455.1.1 OST
      540.1.1 BYZ
      630.1.1 BOS
      880.1.1 SER
      960.1.1 CRO
      997.1.1 BOS
      1050.1.1 CRO
      1077.1.1 DOC
      1095.1.1 SER
      1136.1.1 HUN
      1165.1.1 BYZ
      1180.1.1 BOS
      1305.1.1 CRO
      1322.1.1 BOS
      1463.1.1 TUR
      1878.7.13 HAB
      1918.12.1 YUG
      1941.4.6 GER
      1941.4.10 CRO
      1945.5.8 YUG
      1992.3.3 BHE
      

      All I am wanting is just the “owner” tags and the dates associated with them. As one can see, the format of the document tend to be along the lines of Y.M.D = {}
      One might notice that there is also an unbound "owner = " at the top of the document. Not every one of the text documents has one of these within it, and so it can either be ignored or added to the output listing without a date, or if possible, with “2.1.1” as a placeholder date.
      I hope this is clear enough as to what it is I am wanting to extract from these files, please let me know how I can give more clarification on what I am trying to get here if it is not clear.

      Thanks!

      PeterJonesP Terry RT 2 Replies Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @matto.wh
        last edited by

        @Matthew-Wheaton said in Wanting to extract specific info from several text documents:

        All I am wanting

        Is that all? ;-)

        1. copy all text to a new file (or copy all the files to a new directory) and edit the new copies, because we’re going to be deleting data, and we don’t want it lost from the original (especially if something goes wrong)
        2. format all entries (single-line or multiline) that have both the date and the owner
          FIND = ^(\d+\.\d+\.\d+)\h*=\h*{[^}]*owner\h*=\h*(\w+)[^}]*}(?-s:.*)$
          REPLACE = ☺$1 $2
          SEARCH MODE = regular expression
          REPLACE ALL
        3. Delete any line that doesn’t start with a smiley
          FIND = ^(?!☺).*(\R|\Z)
          REPLACE = empty field
          SEARCH MODE = regular expression
          REPLACE ALL
        4. Delete the smileys:
          FIND = ^☺
          REPLACE = empty field
          SEARCH MODE = regular expression
          REPLACE ALL

        This ignored the owner that wasn’t inside {}s

        There are probably formatting exceptions that my regex won’t handle. This is a good starting point; anything beyond this, I’m going to leave as an exercise to the reader

        ----

        Useful References

        • Please Read Before Posting
        • Template for Search/Replace Questions
        • Formatting Forum Posts
        • Notepad++ Online User Manual: Searching/Regex
        • FAQ: Where to find other regular expressions (regex) documentation

        ----

        Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

        matto.whM 1 Reply Last reply Reply Quote 4
        • Terry RT
          Terry R @matto.wh
          last edited by

          @matto-wh said in Wanting to extract specific info from several text documents:

          please let me know how I can give more clarification on what I am trying to get here if it is not clear.

          A question, almost all the lines are of the type “date” followed by information within the { and } on 1 line. But one seems to be on multiple lines. This example hasn’t been altered at all has it? It is important for anyone who wishes to help you to be able to trust what you show.

          Also, how many files are you needing to accomplished the extraction on?

          Having said that it would seem that a method of getting the data might be to (this would use copies of the files since we are destroying data within them):

          1. remove all line feeds (if sets can appear on multiple lines)
          2. arrange the data so each line contains 1 “date” and “owner =” set
          3. remove all other text

          Note this is just a high level description. That’s how I operate. Think the solution in my home language, then the coding (in whatever computer language is required) can be more easily accomplished.

          Terry

          PS just about to post when I see @PeterJones post, very similar to my concept.

          matto.whM 1 Reply Last reply Reply Quote 3
          • matto.whM
            matto.wh @Terry R
            last edited by matto.wh

            @Terry-R

            This example hasn’t been altered at all has it?

            No I have not altered the example at all. That is part of the issue I have is that in some of the documents there are instances like this of there being multiple lines, or where the "owner = " is not the first item listed.

            Also, how many files are you needing to accomplished the extraction on?

            It’s a couple thousand files. Hence my wanting to automate it in some capacity. The “date” and “owner” sets are the data I am trying to extract from these files. Destroying all the other text in the (copies of the) files aside from the dates and owner pairings would likely be the most effective means of doing this I think. I am not very familiar with programming/coding or even Notepad++, so I would like to know how exactly I would go about making it do such a thing?

            Thanks

            Terry RT 1 Reply Last reply Reply Quote 0
            • Terry RT
              Terry R @matto.wh
              last edited by

              @matto-wh said in Wanting to extract specific info from several text documents:

              so I would like to know how exactly I would go about making it do such a thing

              Well @PeterJones solution (which is coded as a regular expression (regex) and ready to try) would be a good starting point.

              Often when we provide such regex we can easily code for the majority of situations, but we sometimes aren’t aware of edge cases. Your one in the example where the “set” is on multiple lines is one such case, albeit it is shown so can be catered for.

              Do you have other means to verify the completeness of the data extracted? It’s easy to provide regex but we have to leave it to the requester to verify that it meets their needs.

              You mentioned over 1000 files to be consumed. I wouldn’t suggest running a provided solution on that many until you have tested it on 1 file where you have independently solved it by other means and confirmed the result using a regex.

              So give the provided solution a go. Very possibly (as @PeterJones mentioned) you may need to return to post again in this thread and show other edge cases you weren’t aware of. From that someone could further refine the solution until you were sure of the result.

              Terry

              1 Reply Last reply Reply Quote 1
              • Mark OlsonM
                Mark Olson
                last edited by

                With a little preprocessing, my JsonTools plugin can handle this.

                I think that PeterJones’ solution is fine, and with a little tweaking it can handle weirder formatting, so don’t try this unless you get desperate. If your file is so pathological that JsonTools can’t help you, IDK, maybe try PythonScript.

                1. Add a { (open curly brace) as the first character of the file
                2. Find/replace [\w\.]+ with "$0" (regular expressions on)
                3. Find/replace = with :
                4. Use JsonTools to pretty-print the file, since it is now reasonably close to syntactically valid JSON
                5. Open up the JSON tree view.
                6. Run the following query: items(@.g`^[\\d\\.]+$`.owner)[:]{date:@[0], country:@[1]}
                7. Use the JSON-to-CSV form to dump the result of the query into a CSV file.
                8. The result isn’t exactly what you wanted, but it’s close enough, and it’s in CSV format, which is intrinsically useful.
                country,date
                CRO,1050.1.1
                DOC,1077.1.1
                SER,1095.1.1
                HUN,1136.1.1
                BYZ,1165.1.1
                BOS,1180.1.1
                CRO,1305.1.1
                BOS,1322.1.1
                TUR,1463.1.1
                HAB,1878.7.13
                YUG,1918.12.1
                CRO,1941.4.10
                GER,1941.4.6
                YUG,1945.5.8
                BHE,1992.3.3
                ROW,395.1.17
                OST,455.1.1
                BYZ,540.1.1
                BOS,630.1.1
                SER,880.1.1
                CRO,960.1.1
                BOS,997.1.1
                
                
                PeterJonesP 1 Reply Last reply Reply Quote 2
                • PeterJonesP
                  PeterJones @Mark Olson
                  last edited by

                  @Mark-Olson said in Wanting to extract specific info from several text documents:

                  With a little preprocessing, my JsonTools plugin can handle this.

                  Wouldn’t you have to do it individually for each of the “couple thousand files”? Or can the JsonTools plugin run the same search/replace across “all open files” or “all files in folder” akin to “Find in Files”?

                  Because my native regex search/replace version can work in “Find in Files”, so it would do each of steps across the thousands of files without any user interaction (and the files don’t even have to be open in Notepad++).

                  Mark OlsonM 1 Reply Last reply Reply Quote 2
                  • Mark OlsonM
                    Mark Olson @PeterJones
                    last edited by Mark Olson

                    @PeterJones
                    Oh, I missed that part. Yeah, that makes my cute solution impractical.

                    The JSON-from-files-and-APIs form (which I just call the grepper form) can run the query on thousands of files, but;

                    1. it wouldn’t actually replace the contents of the file
                    2. it would only output JSON anyway
                    3. You’d have to do the initial preprocessing beforehand, at which point you’re already using some kind of grep tool so why bother using JsonTools

                    Nobody has ever asked me to implement a feature that could do that, and I’ve never seen the need.

                    That said, if for whatever reason people want to see how to solve this using a combination of JsonTools and PythonScript, I can show you.

                    1 Reply Last reply Reply Quote 1
                    • matto.whM
                      matto.wh @PeterJones
                      last edited by

                      @PeterJones
                      Fantastic! That is exactly what I have been wanting! Thank you very much!

                      1 Reply Last reply Reply Quote 2
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors