FAQ: Parsing and Editing JSON with regex is a bad idea
-
You are likely reading this FAQ because you are trying to use regular expressions to work with JSON, and someone wants to explain why this is a bad idea to you. And it is a very bad idea to parse JSON with regular expressions, as you may have already discovered at the cost of hours of frustration and dead-ends. A normal programming language (e.g., Python, JavaScript) is the proper way to work with JSON.
TL;DR: OK, I believe you. I shouldn’t use regex to search in or edit JSON. What can I use instead?
- Python
- Overview: Python is an easy-to-learn language, and its json module makes it easy to work with JSON.
- Plugins: In addition, the PythonScript plugin allows you to run Python scripts inside of Notepad++.
- For example, here is how to parse the current document as JSON, and pretty-print it:
# Npp gives us access to the editor component, which manages # the text, selections, and styling of the current document from Npp import editor # json is the Python standard library module for working with json import json # get the text of the current document text = editor.getText() # try to parse the current document as json (you might get an error here) js = json.loads(text) # display the json in the PythonScript console print(js) # set the text of the current file to the pretty-printed version of the JSON editor.setText(json.dumps(js, indent=4))
- JavaScript
- Overview: JavaScript is another very popular language, but I don’t know as much about it TBH. The
JSON.parse
andJSON.stringify
functions let you work with JSON. - Plugins: jN (JavaScript for Notepad++) is another plugin, but I know essentially nothing about it
- Overview: JavaScript is another very popular language, but I don’t know as much about it TBH. The
- JsonTools
- Overview: this is a plugin that I wrote. It includes a bunch of stuff, including:
- a scripting tool for editing and viewing JSON
- commands for pretty-printing and compressing JSON
- a parser that can handle JSON with comments and many kinds of errors
- I recommend you read the documentation if you are interested.
- Overview: this is a plugin that I wrote. It includes a bunch of stuff, including:
Please remember that while questions about JsonTools are appropriate for this forum (because it is a Notepad++ plugin), questions about Python or JavaScript should be asked elsewhere (e.g. StackOverflow) unless they are about the PythonScript or jN plugins specifically.
A common objection
Before I begin the FAQ, I should probably dispense with perhaps the most common objection to using anything other than regex to parse JSON.
But I’m not a programmer, and I don’t want to become one!
Fair enough. But whether you like it or not, JSON was designed by programmers for programmers, and if you’re working with JSON, it behooves you to learn to think like a programmer.
Python and JavaScript are both easier than you might think, and both languages are excellent tools for working with JSON. You don’t need to become an expert to get a lot of benefit from knowing programming.
But I’m not smart enough to become a programmer!
Oh really? Look at this regular expression below to get a feel for what a relatively simple (not to mention easily broken) regular expression for parsing JSON looks like.
(?-i)(?:(?<group1>(?-s).+)(?<group2>(?s).+?-{10}.+?"name" : )"[^"]+")|(?:\R+-{10}\R)
If you are smart enough to understand the above regular expression, you are almost certainly smart enough to understand simple programs in Python or JavaScript.
I personally took much longer to get comfortable enough with regular expressions to understand that regex than I took to get comfortable enough with Python to write a program that could solve the same problem that the regex above was trying to solve. If you can understand this regex and modify it to your needs after only a few hours of reading, congratulations, you’re likely smarter than Mark Olson, the writer of this FAQ.
Even though I’m smart enough to learn how to program, I (don’t want to/don’t have time to/have some other reason)
Fair enough. You can still stick around for the rest of the FAQ if you want, but it’s not really directed at you.
Some terminology:
[1, 2, 3]
is an array.- the numbers in this array are interchangeably referred to as its elements or its children.
{"foo": 1, "bar": 2}
is an object.foo
andbar
are the keys and1
and2
are the values.- For simplicity, the child of key
key
in an object is also referred to as thekey
field of that object.
For other terminology, see the official JSON specification.
Beginning of the reasons why you shouldn’t parse JSON with regex
For the purposes of this tutorial/rant/discussion, I will be working with this simple JSON document.
[ { "department": 1, "name": "foo", "employees": [ {"age": 23.5, "name": "Bob", "favorite_ascii_chars": "A{\""}, {"age": 41.75, "favorite_ascii_chars": "Q]\\", "name": "Amy"} ] }, { "department": 2, "employees": [{"name": "Mike", "age": 46.0, "favorite_ascii_chars": "}*:"}], "name": "bar" }, { "employees": [ {"age": 25.0, "favorite_ascii_chars": "\\h'", "name": "Alex"}, {"age": 24.0, "favorite_ascii_chars": "b\\\"", "name": "Sam"} ], "department": 3, "name": "baz" } ]
and the question we want to answer is what are the names and departments of the employees whose names begin with a letter before
M
in the alphabet?By the way, this is not a challenge for the regex gurus out there. If you really want to try to solve this problem using regex, then fine. But I am explicitly urging you not to do this.
All whitespace (except inside strings) is insignificant
Self-explanatory. But it means that your regex needs to be littered with
\s*
to deal with this fact.The order of keys in a JSON object is not syntactically important, but regular expressions care quite a lot.
You may notice that this document’s
employee
objects (children of theemployees
field of thedepartment
objects) all have three keys:age
, a number.name
, a string.favorite_ascii_chars
, a string.
You may also notice that these two employee objects have the keys in different orders
{"age": 23.5, "name": "Bob", "favorite_ascii_chars": "A{\""}
{"age": 41.75, "favorite_ascii_chars": "Q]\\", "name": "Amy"}
Pretty much every JSON parser under the sun does not care about the order of these keys, so your regular expression needs to work whether the key you care about is in the beginning, middle, or end of the object.
JSON strings are trickier to parse with regular expressions than you think
I speak from experience.
Your first instinct is probably to do something like
"[^"]*"
but this is wrong, because you need to consider literal quote characters (which are escaped with\
), and it also forgets that you can’t include CR or LF in strings.So you try
"(?:\\"|[^"\r\n])*"
but this is also wrong, because it doesn’t correctly reflect the other things that can be escaped, like\t
and\n
.Long story short, the only regular expression I think is reasonably close to correct is
(?-si)"(?:\\(?:[nrbtf/"\\]|u[\da-fA-F]{4})|[^\\"\r\n])*"
and while this regex could be simplified (at the cost of accepting some non-compliant strings) to"(?:\\[^\r\n]|[^\r\n])*?"
, it’s a nuisance to have to copy-paste that regex again and again into your master regex that actually does the work for you.If you try to use your regex on a file with a JSON syntax error, your results will likely be garbage
Even if you remember every rule discussed above on how to parse JSON, only a JSON parser can determine if a file is valid JSON, and it’s very easy for a regex that works correctly on valid JSON to exhibit inappropriate behavior on invalid JSON.
For example, when you use the JSON-string-finding regex
"(?:\\[^\r\n]|[^\r\n])*?"
to find strings in the invalid JSON file,{"foo": "bar\", "baz": "quz"}
, the regex will happily inform you that the file contains the strings ("foo"
,"bar\", "
), and": "
), when the correct answer is that the file does not contain any JSON strings, because it is not valid JSON.You will forget at least one of the above rules at some point, and it will come back to bite you.
If you perfectly internalize every rule you ever learn, and you never forget to apply them where it’s appropriate, then congratulations, you are a much better programmer than I am.
Even before considering the corner cases discussed above, JSON-parsing regular expressions tend to be ugly as sin
Even if you could parse JSON with regular expressions, it is (usually) more efficient to use a JSON parser (warning: technical details ahead)
JSON parsers only need to read a JSON file once to correctly parse it, and while they read some characters multiple times, the amount of time required to parse a file should exhibit linear growth.
By contrast, regular expressions have the potential to read the same text many times, and sometimes the number of times the regex engine reads the document is proportional to the size of the document. This can lead to very bad performance on large documents.
Once a JSON file is parsed, most languages store objects using a data structure called a hashmap, which allows retrieval of the value associated with a key in an amount of time that usually does not depend on the number of keys in the object.
Similarly, parsed arrays allow for random access to elements in an amount of time that does not depend on where the element is in the array. If you do enough random lookups, the time savings can be significant. By contrast, a regular expression needs to scan starting at the top of the array every time it wants to find an element.
Spoiler: how to solve the above problem using PythonScript or JsonTools
One way to answer this problem using JSON is as follows:
[ {"dept": 1, "emp_names": ["Bob", "Amy"]}, {"dept": 2, "emp_names": []}, {"dept": 3, "emp_names": ["Alex"]} ]
Essentially, we get an array of objects where the
dept
field is a department number and theemp_names
field is a list of names of employees in that department whose names begin with a letter beforeM
in the alphabet.Solution using PythonScript
- save the following script in the
%AppData%\Roaming\Notepad++\plugins\config\PythonScript\scripts
folder. Call itdont_parse_json_with_regex_FAQ.py
.
# get access to the editor (text of current doc) and notepad (file manager) from Npp import editor, notepad # get the standard Python JSON library (https://docs.python.org/3/library/json.html) import json # get the standard Python regex library (https://docs.python.org/3/library/re.html) import re # TODO (not shown here because it is not Notepad++ - specific) # 1. read the documentation for json and re # 2. figure out how to solve this problem # 3. Save the answer to a variable named dept_empnames # create a new file notepad.new() # dump the pretty-printed json in the new file new_json_str = json.dumps(dept_empnames, indent=4) editor.setText(new_json_str)
- open the file containing the JSON example above.
- Click on
Plugins->PythonScript->scripts->dont_parse_json_with_regex_FAQ
- A new file will open containing the JSON answer shown above (formatted a bit differently)
Solution using JsonTools
- go to
Plugins->JsonTools->Open JSON tree viewer
and a tree view will appear for the document. Explore it to get a feel for how it works. - Enter the query
@[:]{dept: @.department, emp_names: @.employees[:].name[@ =~ `^(?i)[A-L]`]}
in the text box above the tree view.- See the RemesPath docs for information about the syntax of this query.
- The key thing to understand about RemesPath is the way the
@
symbol represents the “current JSON”. - Initially, the “current JSON” is the document.
- However, inside of
{dept: @.department, emp_names: @.employees...}
,@
instead refers to each department object in the original JSON document., because it comes after@[:]
, which iterates through an array. - Next we get the
employees
field (@.employees
) of the current JSON (which is now a department object) - Next we want to get the
name
field of each employee object in the current JSON (which is now an array of employee objects). This is represented by appending[:].name
to the subquery of step 5. - Next we want to test if each employee name (which is now the current JSON) starts with a letter from A-L (case-insensitive). This is done by appending
[@ =~ `^(?i)[A-L]`]
to the subquery of steps 5-6. - So, to combine steps 5-7, if we want to get the
name
field (but only if it starts with a letter from A-L) of each employee object in theemployees
field of each department object, we have the subquery@.employees[:].name[@ =~ `^(?i)[A-L]`]
- The
{dept: <something1>, emp_names: <something2>}
syntax creates an object that mapsdept
tosomething1
(which could be a function of input, a number, or whatever) andemp_names
tosomething2
(which could also be anything) - Putting it all together, the query we use to get the answer we want is
@[:]{dept: @.department, emp_names: @.employees[:].name[@ =~ `^(?i)[A-L]`]}
- Click the
Save query result
button above the tree view. - A new document should appear containing the JSON answer above.
Summary of the different approaches
As you can see, JsonTools and PythonScript offer two very different approaches to this problem.
- PythonScript is an extremely flexible general-purpose tool, and as such it takes a bit more setup to solve this problem with it. However, you can go further in the long run with knowledge of Python.
- JsonTools is a domain-specific and (hopefully) user-friendly tool, and JSON queries written in JsonTools tend to require a lot fewer lines of code than the same query written in Python. However, RemesPath is much more likely to have bugs than Python (please let Mark Olson know if you find one), and its error messages may sometimes be cryptic.
- Python
-
Corollary: Parsing and Editing XML or HTML with regex is also a bad idea
For essentially the same reasons.
More on the horrors of HTML and regex can be found in this Stack Overflow answer.
For working with XML and similar markups in Notepad++, the plugin XMLTools is an excellent choice as the first line of attack, and using one of the scripting plugins (like PythonScript, LuaScript, or jN) and the XML libraries that are appropriate to that programming language will allow you to tackle the more complicated problems.
-
-
-
-
-
-
-