FAQ: Parsing and Editing JSON with regex is a bad idea

Mark Olson

You are likely reading this FAQ because you are trying to use regular expressions to work with JSON, and someone wants to explain why this is a bad idea to you. And it is a very bad idea to parse JSON with regular expressions, as you may have already discovered at the cost of hours of frustration and dead-ends. A normal programming language (e.g., Python, JavaScript) is the proper way to work with JSON.

TL;DR: OK, I believe you. I shouldn’t use regex to search in or edit JSON. What can I use instead?

Python

Overview: Python is an easy-to-learn language, and its json module makes it easy to work with JSON.
Plugins: In addition, the PythonScript plugin allows you to run Python scripts inside of Notepad++.
For example, here is how to parse the current document as JSON, and pretty-print it:

# Npp gives us access to the editor component, which manages
#     the text, selections, and styling of the current document
from Npp import editor
# json is the Python standard library module for working with json
import json

# get the text of the current document
text = editor.getText()
# try to parse the current document as json (you might get an error here)
js = json.loads(text)
# display the json in the PythonScript console
print(js)
# set the text of the current file to the pretty-printed version of the JSON
editor.setText(json.dumps(js, indent=4))

JavaScript
- Overview: JavaScript is another very popular language, but I don’t know as much about it TBH. The JSON.parse and JSON.stringify functions let you work with JSON.
- Plugins: jN (JavaScript for Notepad++) is another plugin, but I know essentially nothing about it
JsonTools
- Overview: this is a plugin that I wrote. It includes a bunch of stuff, including:
  - a scripting tool for editing and viewing JSON
  - commands for pretty-printing and compressing JSON
  - a parser that can handle JSON with comments and many kinds of errors
- I recommend you read the documentation if you are interested.

Please remember that while questions about JsonTools are appropriate for this forum (because it is a Notepad++ plugin), questions about Python or JavaScript should be asked elsewhere (e.g. StackOverflow) unless they are about the PythonScript or jN plugins specifically.

A common objection

Before I begin the FAQ, I should probably dispense with perhaps the most common objection to using anything other than regex to parse JSON.

But I’m not a programmer, and I don’t want to become one!

Fair enough. But whether you like it or not, JSON was designed by programmers for programmers, and if you’re working with JSON, it behooves you to learn to think like a programmer.

Python and JavaScript are both easier than you might think, and both languages are excellent tools for working with JSON. You don’t need to become an expert to get a lot of benefit from knowing programming.

But I’m not smart enough to become a programmer!

Oh really? Look at this regular expression below to get a feel for what a relatively simple (not to mention easily broken) regular expression for parsing JSON looks like.

(?-i)(?:(?<group1>(?-s).+)(?<group2>(?s).+?-{10}.+?"name" : )"[^"]+")|(?:\R+-{10}\R)

If you are smart enough to understand the above regular expression, you are almost certainly smart enough to understand simple programs in Python or JavaScript.

I personally took much longer to get comfortable enough with regular expressions to understand that regex than I took to get comfortable enough with Python to write a program that could solve the same problem that the regex above was trying to solve. If you can understand this regex and modify it to your needs after only a few hours of reading, congratulations, you’re likely smarter than Mark Olson, the writer of this FAQ.

Even though I’m smart enough to learn how to program, I (don’t want to/don’t have time to/have some other reason)

Fair enough. You can still stick around for the rest of the FAQ if you want, but it’s not really directed at you.

Some terminology:

[1, 2, 3] is an array.

the numbers in this array are interchangeably referred to as its elements or its children.

{"foo": 1, "bar": 2} is an object.

foo and bar are the keys and 1 and 2 are the values.
For simplicity, the child of key key in an object is also referred to as the key field of that object.

For other terminology, see the official JSON specification.

Beginning of the reasons why you shouldn’t parse JSON with regex

For the purposes of this tutorial/rant/discussion, I will be working with this simple JSON document.

[
	{
		"department": 1,
		"name": "foo",
		"employees": [
			{"age": 23.5, "name": "Bob", "favorite_ascii_chars": "A{\""},
			{"age": 41.75, "favorite_ascii_chars": "Q]\\", "name": "Amy"}
		]
	},
	{
		"department": 2,
		"employees": [{"name": "Mike", "age": 46.0, "favorite_ascii_chars": "}*:"}],
		"name": "bar"
	},
	{
		"employees": [
			{"age": 25.0, "favorite_ascii_chars": "\\h'", "name": "Alex"},
			{"age": 24.0, "favorite_ascii_chars": "b\\\"", "name": "Sam"}
		],
		"department": 3,
		"name": "baz"
	}
]

and the question we want to answer is what are the names and departments of the employees whose names begin with a letter before M in the alphabet?

By the way, this is not a challenge for the regex gurus out there. If you really want to try to solve this problem using regex, then fine. But I am explicitly urging you not to do this.

All whitespace (except inside strings) is insignificant

Self-explanatory. But it means that your regex needs to be littered with \s* to deal with this fact.

The order of keys in a JSON object is not syntactically important, but regular expressions care quite a lot.

You may notice that this document’s employee objects (children of the employees field of the department objects) all have three keys:

age, a number.
name, a string.
favorite_ascii_chars, a string.

You may also notice that these two employee objects have the keys in different orders
{"age": 23.5, "name": "Bob", "favorite_ascii_chars": "A{\""}
{"age": 41.75, "favorite_ascii_chars": "Q]\\", "name": "Amy"}

Pretty much every JSON parser under the sun does not care about the order of these keys, so your regular expression needs to work whether the key you care about is in the beginning, middle, or end of the object.

JSON strings are trickier to parse with regular expressions than you think

I speak from experience.

Your first instinct is probably to do something like "[^"]*" but this is wrong, because you need to consider literal quote characters (which are escaped with \), and it also forgets that you can’t include CR or LF in strings.

So you try "(?:\\"|[^"\r\n])*" but this is also wrong, because it doesn’t correctly reflect the other things that can be escaped, like \t and \n.

Long story short, the only regular expression I think is reasonably close to correct is (?-si)"(?:\\(?:[nrbtf/"\\]|u[\da-fA-F]{4})|[^\\"\r\n])*" and while this regex could be simplified (at the cost of accepting some non-compliant strings) to "(?:\\[^\r\n]|[^\r\n])*?", it’s a nuisance to have to copy-paste that regex again and again into your master regex that actually does the work for you.

If you try to use your regex on a file with a JSON syntax error, your results will likely be garbage

Even if you remember every rule discussed above on how to parse JSON, only a JSON parser can determine if a file is valid JSON, and it’s very easy for a regex that works correctly on valid JSON to exhibit inappropriate behavior on invalid JSON.

For example, when you use the JSON-string-finding regex "(?:\\[^\r\n]|[^\r\n])*?" to find strings in the invalid JSON file, {"foo": "bar\", "baz": "quz"}, the regex will happily inform you that the file contains the strings ("foo", "bar\", "), and ": "), when the correct answer is that the file does not contain any JSON strings, because it is not valid JSON.

You will forget at least one of the above rules at some point, and it will come back to bite you.

If you perfectly internalize every rule you ever learn, and you never forget to apply them where it’s appropriate, then congratulations, you are a much better programmer than I am.

Even before considering the corner cases discussed above, JSON-parsing regular expressions tend to be ugly as sin

https://xkcd.com/1171/

I had 99 problems. Then I used regular expressions. Now I have 100 problems.

Even if you could parse JSON with regular expressions, it is (usually) more efficient to use a JSON parser (warning: technical details ahead)

JSON parsers only need to read a JSON file once to correctly parse it, and while they read some characters multiple times, the amount of time required to parse a file should exhibit linear growth.

By contrast, regular expressions have the potential to read the same text many times, and sometimes the number of times the regex engine reads the document is proportional to the size of the document. This can lead to very bad performance on large documents.

Once a JSON file is parsed, most languages store objects using a data structure called a hashmap, which allows retrieval of the value associated with a key in an amount of time that usually does not depend on the number of keys in the object.

Similarly, parsed arrays allow for random access to elements in an amount of time that does not depend on where the element is in the array. If you do enough random lookups, the time savings can be significant. By contrast, a regular expression needs to scan starting at the top of the array every time it wants to find an element.

Spoiler: how to solve the above problem using PythonScript or JsonTools

One way to answer this problem using JSON is as follows:

[
	{"dept": 1, "emp_names": ["Bob", "Amy"]},
	{"dept": 2, "emp_names": []},
	{"dept": 3, "emp_names": ["Alex"]}
]

Essentially, we get an array of objects where the dept field is a department number and the emp_names field is a list of names of employees in that department whose names begin with a letter before M in the alphabet.

Solution using PythonScript

save the following script in the %AppData%\Roaming\Notepad++\plugins\config\PythonScript\scripts folder. Call it dont_parse_json_with_regex_FAQ.py.

# get access to the editor (text of current doc) and notepad (file manager)
from Npp import editor, notepad
# get the standard Python JSON library (https://docs.python.org/3/library/json.html)
import json
# get the standard Python regex library (https://docs.python.org/3/library/re.html)
import re

# TODO (not shown here because it is not Notepad++ - specific)
# 1. read the documentation for json and re
# 2. figure out how to solve this problem
# 3. Save the answer to a variable named dept_empnames

# create a new file
notepad.new()
# dump the pretty-printed json in the new file
new_json_str = json.dumps(dept_empnames, indent=4)
editor.setText(new_json_str)

open the file containing the JSON example above.
Click on Plugins->PythonScript->scripts->dont_parse_json_with_regex_FAQ
A new file will open containing the JSON answer shown above (formatted a bit differently)

Solution using JsonTools

go to Plugins->JsonTools->Open JSON tree viewer and a tree view will appear for the document. Explore it to get a feel for how it works.
Enter the query @[:]{dept: @.department, emp_names: @.employees[:].name[@ =~ `^(?i)[A-L]`]} in the text box above the tree view.
1. See the RemesPath docs for information about the syntax of this query.
2. The key thing to understand about RemesPath is the way the @ symbol represents the “current JSON”.
3. Initially, the “current JSON” is the document.
4. However, inside of {dept: @.department, emp_names: @.employees...}, @ instead refers to each department object in the original JSON document., because it comes after @[:], which iterates through an array.
5. Next we get the employees field (@.employees) of the current JSON (which is now a department object)
6. Next we want to get the name field of each employee object in the current JSON (which is now an array of employee objects). This is represented by appending [:].name to the subquery of step 5.
7. Next we want to test if each employee name (which is now the current JSON) starts with a letter from A-L (case-insensitive). This is done by appending [@ =~ `^(?i)[A-L]`] to the subquery of steps 5-6.
8. So, to combine steps 5-7, if we want to get the name field (but only if it starts with a letter from A-L) of each employee object in the employees field of each department object, we have the subquery @.employees[:].name[@ =~ `^(?i)[A-L]`]
9. The {dept: <something1>, emp_names: <something2>} syntax creates an object that maps dept to something1 (which could be a function of input, a number, or whatever) and emp_names to something2 (which could also be anything)
10. Putting it all together, the query we use to get the answer we want is @[:]{dept: @.department, emp_names: @.employees[:].name[@ =~ `^(?i)[A-L]`]}
Click the Save query result button above the tree view.
A new document should appear containing the JSON answer above.

Summary of the different approaches

As you can see, JsonTools and PythonScript offer two very different approaches to this problem.

PythonScript is an extremely flexible general-purpose tool, and as such it takes a bit more setup to solve this problem with it. However, you can go further in the long run with knowledge of Python.
JsonTools is a domain-specific and (hopefully) user-friendly tool, and JSON queries written in JsonTools tend to require a lot fewer lines of code than the same query written in Python. However, RemesPath is much more likely to have bugs than Python (please let Mark Olson know if you find one), and its error messages may sometimes be cryptic.

PeterJones

Corollary: Parsing and Editing XML or HTML with regex is also a bad idea

For essentially the same reasons.

More on the horrors of HTML and regex can be found in this Stack Overflow answer.

For working with XML and similar markups in Notepad++, the plugin XMLTools is an excellent choice as the first line of attack, and using one of the scripting plugins (like PythonScript, LuaScript, or jN) and the XML libraries that are appropriate to that programming language will allow you to tackle the more complicated problems.