Proposed FAQ explaining why you shouldn't parse JSON with regex
-
It’s looking good to me.
When you’re ready, you could create a new post in the Blogs section, and then ask me here to move it into the FAQ section for you. Or you could just reply here that you want it as it is in your most-recent post, and I can copy that and create the new Topic in the FAQ section and change the owner to you.
As far as maintaining it goes, you’ll only have edit permission for the same few hours that are allowed with normal posts. So I think the best would be for you to set up a gist on GitHub, or put it in a folder of some other GitHub repo that you control (or any other preferred publically-facing repo system), and then when you update that you could direct-message me, and I could take the contents from there and update the FAQ. Or, if you wanted, I could add a file to the same GitHub repo/directory where I store most of the other active FAQ content – and if you wanted to update the contents, you could do a PR for that file (and me handling the PR would obviously also prompt me to merge the contents into the actual FAQ).
So let me know when you’re ready for it to be added to the FAQ section
-
-
@Mark-Olson said,
Great!
FAQ entry created.
I added a brief corollary (I hope you don’t mind).
update: I have subscribed to that gist, so when you update it, I believe GitHub will notify me automatically; if you make a change to the gist that doesn’t get reflected here within a day or so, you might want to ping me in Chat to make sure I noticed it.
-
Many of the reasons for not using regex to parse JSON also apply to HTML. I don’t know if it would help or hurt to generalize the FAQ entry to include HTML.
One other item for why it’s bad is that Notepad++'s regex is not as graceful or agile at processing Unicode characters U+10000 to U+10FFFF. This region used to be obscure, pretty much never used, or seen, stuff but with the introduction and then expansion of U+1F300 to U+1F5FF Miscellaneous Symbols and Pictographs plus U+1F600 to U+1F64F emoticons these characters are now regularly showing up in HTML and JSON.
-
@mkupper said in Proposed FAQ explaining why you shouldn't parse JSON with regex:
Many of the reasons for not using regex to parse JSON also apply to HTML
Hence my corollary. 😉
-
For what it is worth JSON is an acronym for JavaScript Object Notation. If you have programmed a bit in Javascript, it’s quite easy to understand JSON.
As to why you shouldn’t parse HTML with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454
-
@mkupper said in Proposed FAQ explaining why you shouldn't parse JSON with regex:
Many of the reasons for not using regex to parse JSON also apply to HTML
Strongly agree. For the lulz, I also recommend changing the corollary post to link to this StackOverflow post, which is easily my favorite help forum post of all time.
Wouldn’t hurt to mention the XMLTools plugin in the corollary either.
-
Strongly agree. For the lulz, I also recommend changing the corollary post to link to this StackOverflow post, which is easily my favorite help forum post of all time.
Wouldn’t hurt to mention the XMLTools plugin in the corollary either.
Done and done. (Like @ThosRTanner , I actually linked to the reply rather than the question on SO, because it’s the important part for the reader to be directed to.)
-
@Mark-Olson said in Proposed FAQ explaining why you shouldn't parse JSON with regex:
It seems like there are a lot of people who want to use regular expressions to parse JSON. This is a dangerous trap for reasons that I will explain below
Following on from the recent post here regarding updating one JSON file with a particular field from another (so a partial merge process) and the resulting question about whether regex was appropriate or a Python (or other programming language) was more appropriate I went in search of an answer that satisfied me, I still haven’t found it. Lots of posts about it “being a bad idea” but I cannot find any substance to back it up. Maybe we’ve been lucky on this forum that the JSON queries are from simply structured JSON files which are likely to come to no harm from using regex, or maybe the OP goes away in disgust at us having wrecked his data (all solutions come with a caveat!), never to darken our (forum) door again.
But what I did find was a number of apps (modules) that have already been created and they are designed to work across many JSON structures. Unfortunately it will still require those who wish to use them a learning curve so they can utilise these apps. Possibly that learning curve is still flatter than learning a new programming language.
An interesting one was this on github, called jsonmerge. It’s a Python module. The licence states it is free for anyone to copy, modify etc etc. Maybe @Mark-Olson might be interested since he’s the author of JSONTools for NPP. Could we possibly see JSONTools enhanced in a new version by incorporating this? I think it is worthy of a look.
Of course there are several other free standing apps, any one of which may also be better than dealing with JSON inside of NPP, but that’s not for this forum to ponder.
food for thought
Terry -
TL;DR: as the FAQ says, if you don’t want to learn a scripting language, you are welcome to ignore the FAQ. Just remember that any approach that doesn’t involve a JSON parser will almost certainly take dozens of times longer and more error-prone than a solution that uses a JSON parser.
Every minute wasted debugging a JSON-parsing regex is a minute that could have been spent learning a scripting language. The difference is that a JSON-parsing regex will not generalize from one task to another, but knowledge of a scripting language can be applied to all manner of tasks.
As I hope the FAQ makes clear, I don’t dislike or feel superior to people who want to parse JSON with regex. I just wish they would try a different approach and stop wasting their time. Lord knows I have wasted plenty of time myself (probably more than most other people in this forum) for similar reasons.
Could we possibly see JSONTools enhanced in a new version by incorporating this [JsonMerge]? I think it is worthy of a look.
No, but thanks for the suggestion. JsonTools was implemented in C#, and JsonMerge is Python, so I would have to reimplement all the algorithms in JsonMerge, and I am not a capable enough programmer to do this well.
Maybe we’ve been lucky on this forum that the JSON queries are from simply structured JSON files which are likely to come to no harm from using regex
Yes, exactly. The only reason your regex solutions “work” is that you are getting away with assuming the format is constant. As I note in the FAQ, every regex solution, including many that are so nightmarishly complicated by corner cases as to be impossible to debug, will most likely fail if the format changes (e.g., pretty-printed, keys of an object sorted).
-
@Terry-R said in Proposed FAQ explaining why you shouldn't parse JSON with regex:
But what I did find was a number of apps (modules) that have already been created and they are designed to work across many JSON structures. Unfortunately it will still require those who wish to use them a learning curve so they can utilise these apps. Possibly that learning curve is still flatter than learning a new programming language.
Yes, there seems to be a no-win situation here for this sort of problem:
- I’m working on something that we planned to deploy quickly, but I’m blocked because I need to change this JSON file.
- The change is conceptually simple (at least, it appears so in my mind), but it happens in dozens of places throughout the file, so doing it manually would be tedious, time-consuming and error-prone.
- I can’t even remember the last time I had to do anything much with a JSON file, and I don’t expect to run into this again anytime soon.
- Did I mention we want this sooner rather than later?
For this person, taking time — a few days? a week? more than that? — to learn a new programming language and then debug a script written in that language is absurd. Getting hung up on a long detour for something that is conceptually simple (“I just need to change all the device names of smart light bulbs from xxx to bulb-xxx!”) would be unreasonable. (Even if you were willing to take the time to learn for future use, you still have this problem that needs a solution now.)
It could be that the RemesPath language in JsonTools is quick enough to grasp that it’s practical to learn that even for a one-off problem. (I confess I haven’t yet attempted to understand it.) But I suspect most people in this situation would just hack away with a regex and hope for the best.
It’s too bad there isn’t a low-learning-curve solution for applying regex-style logic to structured data like JSON, XML, HTML, etc.; but based on these discussions, I’m led to believe it doesn’t exist.
-
@Coises said in Proposed FAQ explaining why you shouldn't parse JSON with regex:
It could be that the RemesPath language in JsonTools is quick enough to grasp that it’s practical to learn that even for a one-off problem.
Well… sort of. RemesPath probably could be a lot easier to grasp, and I think it is currently held back by confusing error messages.
JsonTools has a find/replace form, which essentially translates questions along the lines of
I want to find all keys in this JSON that mach the regular expression `.*name`
into RemesPath queries.This essentially aims to provide training wheels for learning RemesPath, similar to how the GUI in Microsoft Power Query shows the user the M-script it emits.
I guess an example would be, suppose the user wants to find all the values that start with a word containing the substring
AG
(case-insensitive) in thename
field of any object.The workflow with the find/replace form might go as follows:
- the user wants to recursively search for all keys that match
name
exactly, so they:- check the
Show advanced options->Match exactly?
box - leaves
Recursive search?
checked. - leaves
Ignore case?
unchecked - select the
Keys
option from theSearch in keys or values?
dropdown - enter
name
in theFind...
box.
- check the
- The query emitted by the form says
(@)..`name`
, so:- the user looks at the treeview (which will snap to the location of each search result) to see if it found the right values.
- the user copies
..`name`
into theRoot
field of the find/replace form - puts
^\w*AG
into theFind...
field - changes the
Search in keys or values?
dropdown toValues
- checks
Show advanced options->Ignore case?
- unchecks
Show advanced options->Match exactly?
- checks *
Show advanced options->Use regular expressions?
and this emits the query
(@..`name`)..*[str(@) =~ g`(?i)^\w*AG`]
, which will find the desired things. - the user wants to recursively search for all keys that match
-
@Mark-Olson I prefer this quote from 12 Aug 1997:
http://regex.info/blog/2006-09-15/247
Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.” Now they have two problems.