Community
    • Login

    Proposed FAQ explaining why you shouldn't parse JSON with regex

    Scheduled Pinned Locked Moved General Discussion
    jsonregex
    16 Posts 7 Posters 5.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones @Mark Olson
      last edited by PeterJones

      @Mark-Olson said,

      Great!

      Here’s a gist with the FAQ in it.

      FAQ entry created.

      I added a brief corollary (I hope you don’t mind).

      update: I have subscribed to that gist, so when you update it, I believe GitHub will notify me automatically; if you make a change to the gist that doesn’t get reflected here within a day or so, you might want to ping me in Chat to make sure I noticed it.

      1 Reply Last reply Reply Quote 0
      • mkupperM
        mkupper @Mark Olson
        last edited by

        Many of the reasons for not using regex to parse JSON also apply to HTML. I don’t know if it would help or hurt to generalize the FAQ entry to include HTML.

        One other item for why it’s bad is that Notepad++'s regex is not as graceful or agile at processing Unicode characters U+10000 to U+10FFFF. This region used to be obscure, pretty much never used, or seen, stuff but with the introduction and then expansion of U+1F300 to U+1F5FF Miscellaneous Symbols and Pictographs plus U+1F600 to U+1F64F emoticons these characters are now regularly showing up in HTML and JSON.

        PeterJonesP Mark OlsonM 2 Replies Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @mkupper
          last edited by

          @mkupper said in Proposed FAQ explaining why you shouldn't parse JSON with regex:

          Many of the reasons for not using regex to parse JSON also apply to HTML

          Hence my corollary. 😉

          1 Reply Last reply Reply Quote 1
          • ThosRTannerT
            ThosRTanner
            last edited by

            For what it is worth JSON is an acronym for JavaScript Object Notation. If you have programmed a bit in Javascript, it’s quite easy to understand JSON.

            As to why you shouldn’t parse HTML with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454

            1 Reply Last reply Reply Quote 2
            • Mark OlsonM
              Mark Olson @mkupper
              last edited by

              @mkupper said in Proposed FAQ explaining why you shouldn't parse JSON with regex:

              Many of the reasons for not using regex to parse JSON also apply to HTML

              Strongly agree. For the lulz, I also recommend changing the corollary post to link to this StackOverflow post, which is easily my favorite help forum post of all time.

              Wouldn’t hurt to mention the XMLTools plugin in the corollary either.

              PeterJonesP 1 Reply Last reply Reply Quote 2
              • PeterJonesP
                PeterJones @Mark Olson
                last edited by

                @Mark-Olson,

                Strongly agree. For the lulz, I also recommend changing the corollary post to link to this StackOverflow post, which is easily my favorite help forum post of all time.

                Wouldn’t hurt to mention the XMLTools plugin in the corollary either.

                Done and done. (Like @ThosRTanner , I actually linked to the reply rather than the question on SO, because it’s the important part for the reader to be directed to.)

                1 Reply Last reply Reply Quote 2
                • Terry RT
                  Terry R @Mark Olson
                  last edited by

                  @Mark-Olson said in Proposed FAQ explaining why you shouldn't parse JSON with regex:

                  It seems like there are a lot of people who want to use regular expressions to parse JSON. This is a dangerous trap for reasons that I will explain below

                  Following on from the recent post here regarding updating one JSON file with a particular field from another (so a partial merge process) and the resulting question about whether regex was appropriate or a Python (or other programming language) was more appropriate I went in search of an answer that satisfied me, I still haven’t found it. Lots of posts about it “being a bad idea” but I cannot find any substance to back it up. Maybe we’ve been lucky on this forum that the JSON queries are from simply structured JSON files which are likely to come to no harm from using regex, or maybe the OP goes away in disgust at us having wrecked his data (all solutions come with a caveat!), never to darken our (forum) door again.

                  But what I did find was a number of apps (modules) that have already been created and they are designed to work across many JSON structures. Unfortunately it will still require those who wish to use them a learning curve so they can utilise these apps. Possibly that learning curve is still flatter than learning a new programming language.

                  An interesting one was this on github, called jsonmerge. It’s a Python module. The licence states it is free for anyone to copy, modify etc etc. Maybe @Mark-Olson might be interested since he’s the author of JSONTools for NPP. Could we possibly see JSONTools enhanced in a new version by incorporating this? I think it is worthy of a look.

                  Of course there are several other free standing apps, any one of which may also be better than dealing with JSON inside of NPP, but that’s not for this forum to ponder.

                  food for thought
                  Terry

                  Mark OlsonM CoisesC 2 Replies Last reply Reply Quote 0
                  • Mark OlsonM
                    Mark Olson @Terry R
                    last edited by

                    @Terry-R

                    TL;DR: as the FAQ says, if you don’t want to learn a scripting language, you are welcome to ignore the FAQ. Just remember that any approach that doesn’t involve a JSON parser will almost certainly take dozens of times longer and more error-prone than a solution that uses a JSON parser.

                    Every minute wasted debugging a JSON-parsing regex is a minute that could have been spent learning a scripting language. The difference is that a JSON-parsing regex will not generalize from one task to another, but knowledge of a scripting language can be applied to all manner of tasks.

                    As I hope the FAQ makes clear, I don’t dislike or feel superior to people who want to parse JSON with regex. I just wish they would try a different approach and stop wasting their time. Lord knows I have wasted plenty of time myself (probably more than most other people in this forum) for similar reasons.

                    Could we possibly see JSONTools enhanced in a new version by incorporating this [JsonMerge]? I think it is worthy of a look.

                    No, but thanks for the suggestion. JsonTools was implemented in C#, and JsonMerge is Python, so I would have to reimplement all the algorithms in JsonMerge, and I am not a capable enough programmer to do this well.

                    Maybe we’ve been lucky on this forum that the JSON queries are from simply structured JSON files which are likely to come to no harm from using regex

                    Yes, exactly. The only reason your regex solutions “work” is that you are getting away with assuming the format is constant. As I note in the FAQ, every regex solution, including many that are so nightmarishly complicated by corner cases as to be impossible to debug, will most likely fail if the format changes (e.g., pretty-printed, keys of an object sorted).

                    1 Reply Last reply Reply Quote 1
                    • CoisesC
                      Coises @Terry R
                      last edited by

                      @Terry-R said in Proposed FAQ explaining why you shouldn't parse JSON with regex:

                      But what I did find was a number of apps (modules) that have already been created and they are designed to work across many JSON structures. Unfortunately it will still require those who wish to use them a learning curve so they can utilise these apps. Possibly that learning curve is still flatter than learning a new programming language.

                      Yes, there seems to be a no-win situation here for this sort of problem:

                      • I’m working on something that we planned to deploy quickly, but I’m blocked because I need to change this JSON file.
                      • The change is conceptually simple (at least, it appears so in my mind), but it happens in dozens of places throughout the file, so doing it manually would be tedious, time-consuming and error-prone.
                      • I can’t even remember the last time I had to do anything much with a JSON file, and I don’t expect to run into this again anytime soon.
                      • Did I mention we want this sooner rather than later?

                      For this person, taking time — a few days? a week? more than that? — to learn a new programming language and then debug a script written in that language is absurd. Getting hung up on a long detour for something that is conceptually simple (“I just need to change all the device names of smart light bulbs from xxx to bulb-xxx!”) would be unreasonable. (Even if you were willing to take the time to learn for future use, you still have this problem that needs a solution now.)

                      It could be that the RemesPath language in JsonTools is quick enough to grasp that it’s practical to learn that even for a one-off problem. (I confess I haven’t yet attempted to understand it.) But I suspect most people in this situation would just hack away with a regex and hope for the best.

                      It’s too bad there isn’t a low-learning-curve solution for applying regex-style logic to structured data like JSON, XML, HTML, etc.; but based on these discussions, I’m led to believe it doesn’t exist.

                      Mark OlsonM 1 Reply Last reply Reply Quote 1
                      • Mark OlsonM
                        Mark Olson @Coises
                        last edited by Mark Olson

                        @Coises said in Proposed FAQ explaining why you shouldn't parse JSON with regex:

                        It could be that the RemesPath language in JsonTools is quick enough to grasp that it’s practical to learn that even for a one-off problem.

                        Well… sort of. RemesPath probably could be a lot easier to grasp, and I think it is currently held back by confusing error messages.

                        JsonTools has a find/replace form, which essentially translates questions along the lines of I want to find all keys in this JSON that mach the regular expression `.*name` into RemesPath queries.

                        This essentially aims to provide training wheels for learning RemesPath, similar to how the GUI in Microsoft Power Query shows the user the M-script it emits.

                        I guess an example would be, suppose the user wants to find all the values that start with a word containing the substring AG (case-insensitive) in the name field of any object.

                        The workflow with the find/replace form might go as follows:

                        1. the user wants to recursively search for all keys that match name exactly, so they:
                          • check the Show advanced options->Match exactly? box
                          • leaves Recursive search? checked.
                          • leaves Ignore case? unchecked
                          • select the Keys option from the Search in keys or values? dropdown
                          • enter name in the Find... box.
                        2. The query emitted by the form says (@)..`name` , so:
                          • the user looks at the treeview (which will snap to the location of each search result) to see if it found the right values.
                          • the user copies ..`name` into the Root field of the find/replace form
                          • puts ^\w*AG into the Find... field
                          • changes the Search in keys or values? dropdown to Values
                          • checks Show advanced options->Ignore case?
                          • unchecks Show advanced options->Match exactly?
                          • checks * Show advanced options->Use regular expressions?

                        and this emits the query (@..`name`)..*[str(@) =~ g`(?i)^\w*AG`], which will find the desired things.

                        b283942b-d3a9-430f-941f-42599ea4db83-image.png

                        1 Reply Last reply Reply Quote 2
                        • mpheathM
                          mpheath @Mark Olson
                          last edited by

                          https://xkcd.com/1171/
                          8992aaa5-536d-40a3-bb79-ea959b8934ec-Untitled.png

                          @Mark-Olson I prefer this quote from 12 Aug 1997:

                          http://regex.info/blog/2006-09-15/247

                          Some people, when confronted with a problem, think
                          “I know, I’ll use regular expressions.” Now they have two problems.

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors