Brim Data

Introducing zq

An Easier (and Faster) Alternative to jq
Author Steven McCanne

If you’ve ever had the need to query or transform large amounts of JSON data, you’ve undoubtedly turned to the phenomenal and ubiquitous tool jq.

Yet your first experience trying out jq was probably a bit challenging.

Its syntax is powerful but also terse and arguably a little too clever. It’s kind of like a Unix pipeline but not really.

Don’t get me wrong. jq is awesomely useful. But overall, it just feels a bit too hard.

Deciphering jq

So, how about we bite the bullet and just RTFM?

Okay, fine. But if you can remember the first time you skimmed the jq manual, it probably felt like you needed to already know how jq worked in order to understand what the jq manual was saying.

“Everything is a filter” doesn’t quite do it for me.

Any Good Tutorials?

We figured someone out there must have already cracked the code and explained it all. And sure enough, there’s a ton of great content in various places on jq.

If you poke around and search Hacker News you’ll quickly find Adam Gordon Bell’s masterfully written article An Introduction to JQ. As Bell points out, the solution to your jq programming problem is usually just a Google search away.

If you’re more of a visual learner, Szymon Stepniak has some excellent introductory videos. Just search for “szymon jq” on YouTube.

Learning the Terse Syntax

These tutorials do a great job explaining jq’s terse syntax and guiding you through a number of useful query patterns.

They explain what . is. And how the selector-style syntax .a.b.c can be used to dereference a path through a nested JSON object. And how . isn’t just the dereference operator but also refers to the top-level value, e.g., when . is an array you can index it with .[1] and slice it with .[1:2].

They also show how [] is a postfix operator that iterates over the elements of an array or the values of an object:

$ echo '{"array":[1,2]}' | jq '.array[]'
1
2
$ echo '{"object":{"a":1,"b":2}}' | jq '.object[]'
1
2

They explain how you can construct arrays a little bit like JavaScript:

$ echo '1 2' | jq '[0,.]'
[
  0,
  1
]
[
  0,
  2
]

and how to construct objects also like Javascript:

$ echo '1 2' | jq '{x:.,y:(2*.)}'
{
  "x": 1,
  "y": 2
}
{
  "x": 2,
  "y": 4
}

But the model…?

After you put in a little time learning this ingeniously terse syntax, jq gets quite a bit easier… as long as your problem matches a cookie-cutter pattern that you can find with Google or copy from a cheat sheet.

But when you can’t find the pattern you’re looking for, jq is still really hard.

For a long time, I thought it was the syntax that was hard. But after learning the syntax and studying the jq manual quite a bit, I realized it was still hard and that the difficulty must lie somewhere else.

Stateless Dataflow

After thinking about it, I realized I didn’t understand at all what jq was really doing under the covers. It’s one thing to see an example and blindly copy a code pattern to get the desired result. It’s another to really understand the underlying computational model.

After studying the manual and the tutorials and running copious experiments with jq’s debug filter, I think I’ve learned enough to describe jq’s computational model as: stateless dataflow.

The jq model embodies dataflow in the sense that you can feed jq a sequence of values, have it operate on those values, and emit the sequence of results, e.g.,

$ echo '1 2 3' | jq .+1
2
3
4

but it’s stateless in the sense that you cannot combine those values in a meaningful way; in other words, you cannot carry state across the sequence of values.

Curiously, if you try to, say, add up the input with an intuitive move like this you get some pretty cryptic error messages:

$ echo '1 2 3' | jq add
jq: error (at <stdin>:1): Cannot iterate over number (1)
jq: error (at <stdin>:1): Cannot iterate over number (2)
jq: error (at <stdin>:1): Cannot iterate over number (3)

Slurp It

A jq pro would say “Of course that doesn’t work. The add filter takes an array and you need to slurp the input into an array.”

Okay, let’s try out that pattern by specifying -s to slurp. Sure enough this gives the correct answer:

$ echo '1 2 3' | jq -s add
6

But this doesn’t actually make the computation stateful, right? Instead, it simply transforms the input from a sequence of discrete values (numbers) into a single composite value (an array of numbers).

Indeed, this slurping command mode simply transforms the three input values into a single array value as can be seen here:

$ echo '1 2 3' | jq -s .
[
  1,
  2,
  3
]

In other words, the slurped form is equivalent to this non-slurped form operating on an array:

echo '[1,2,3]' | jq add

So, it appears you can emulate stateful dataflow by wrapping each filter’s result in a single but arbitrarily large array value.

Emulating Stateful Dataflow

Okay, now it’s clear: jq is hard because you often want to do stateful things with a bunch of computed values but the dataflow model is stateless.

It’s not the syntax that’s hard to learn; it’s the computational model.

To work around this statelessness, you can wrap a sequence of independent values into an array, iterate over the array, then wrap that result back up into another array so you can pass the entire sequence as a single value downstream to the “next filter”.

To make this all concrete, suppose the input is a sequence of arrays of numbers and the problem is to add up all the numbers across all the arrays.

Your first stab at this might be:

$ echo '[1,2,3] [4,5,6]' | jq add
6
15

But because the model is stateless, you get a separate answer for each input array, which isn’t our desired result.

Okay, no problem just wrap those results into an array [6,15] and apply add to that array and we get:

$ echo '[1,2,3] [4,5,6]' | jq '[add] | add'
6
15

What the heck? That didn’t work. Let’s debug this by taking off the final add:

$ echo '[1,2,3] [4,5,6]' | jq '[add]'
[
  6
]
[
  15
]

Hmm, we get two singleton arrays.

But of course! This is the stateless model at work.

jq is running the entire computation for each input value separately, and we are wrapping each scalar result in its own array.

To remedy this, let’s slurp this input into an array giving us an array of arrays then we will iterate over each array adding up the elements.

As a first step, this gives us the single array we desired:

$ echo '[1,2,3] [4,5,6]' | jq -s '[.[] | add]'
[
  6,
  15
]

And, now we can apply the add filter to this output, arriving at the desired result:

$ echo '[1,2,3] [4,5,6]' | jq -s '[.[] | add] | add'
21

Who knew adding up numbers could be so hard?

Scaling Slurp

Okay, we can use slurp to emulate stateful dataflow but it has a problem: jq must read all of the input into memory to construct the single input array value.

Computers these days have a lot of memory and for many practical use cases, this memory consumption is just fine, but when you want to apply stateful operations to large collections of JSON, e.g., stored as newline-delimited JSON, this approach can become problematic.

The good news is jq has yet another pattern to work around this. This little gem solves our problem from above in a different way:

echo '[1,2,3] [4,5,6]' | jq -n 'reduce inputs as $array (0; .+($array|add))'

which streams the input into the query and gives the correct answer of 21.

What’s going on here?!

We’ll leave it as an exercise for the reader as to why this query works and how a reduce filter and the inputs filter (which incidentally requires the -n option here), all work together to solve our problem.

But here’s the key point: this is still confusing and hard!

If you want to solve stateful problems, you have to mix and match the right command-line options with pretty confusing jq filters to emulate stateful dataflow.

Could there be a better way?

Enter zq

In the Zed project, we’ve developed a command-line tool called zq (and, yes, the name “zq” is a play on “jq”) that is based on an orthogonal set of dataflow operators.

Zed operators are like jq filters but they can be stateful. To make this work, we simply borrowed the aggregate function concept from the 50 year-old SQL model, though unlike SQL, Zed aggregate functions can stream their results incrementally.

In short, zq is based on stateful dataflow while jq is based on stateless dataflow.

Installing zq

To demonstrate Zed’s stateful dataflow model, we’ll run through a number of concepts using zq. If you want to follow along here, feel free to install the zq command.

While zq supports a number of data formats, in this article, we’ll just use JSON. To this end, all the examples here run zq with the -j flag to indicate JSON.

Stateful Dataflow

In a nutshell, stateful dataflow means you can carry state across input values. So with zq, we can easily add up a sequence of numbers without putting them in an array, e.g.,

$ echo '1 2 3' | zq -j 'sum(this)' -
{"sum":6}

Note that this is like . in jq.

This use of sum here is like the SQL aggregate function sum though SQL doesn’t have the concept of this as you must instead explicitly reference columns in tables. That is, if a column x has the values 1, 2, and 3, and you say SELECT SUM(x) you would get the same answer as above.

With zq, this SQL pattern would appear as Zed record values that are operated upon via the field x instead of top-level values:

$ echo '{"x":1}{"x":2}{"x":3}' | zq -j 'sum(x)' -
{"sum":6}

To do this with jq, you have to slurp the input objects and emulate stateful dataflow:

$ echo '{"x":1}{"x":2}{"x":3}' | jq -s '[.[].x] | add | {sum:.}'
{
  "sum": 6
}

Would you rather say sum(x) or [.[].x] | add | {sum:.}?

Are you starting to get why the stateless model of jq is difficult?

The over Operator

After working on the Zed language for a couple of years, we recently realized there was an important piece missing: the ability to flexibly navigate over complex nested values.

We looked at a number of related approaches but most of them felt pretty awkward. Various SQL approaches to JSON involve easy-enough syntax extensions to extract nested JSON, but when it comes to unrolling nested structure, the patterns seemed quite confusing: lateral flattens and joins with magic ways to reference the inter-related subquery components.

Instead, we turned to jq for inspiration. As described above, jq’s syntax can be terse and the dataflow model hard to understand, but the core mechanisms for traversing nested JSON values are brilliant.

Following jq’s lead, we adopted an iteration operator that could generate a subsequence of values from any complex value. That derived sequence could then be operated upon by an arbitrary set of Zed dataflow operators. But unlike jq, the Zed iterator streams its output continuously (like all other Zed operators) so the input sequence need not be wrapped in arrays to produce stateful results.

The over operator is a bit like jq’s [] iterator, e.g., in jq, you might say:

echo '[1,2,3]' | jq .[]

while in zq you would say

echo '[1,2,3]' | zq 'over this' -

And they both give:

1
2
3

The difference here, though, is that zq’s sequence of iterated numbers can be operated upon in stateful ways, e.g.,

$ echo '[1,2,3]' | zq -j 'over this | sum(this)' -
{"sum":6}

Yet with jq, once you iterate over the array and send each number to a downstream filter, you cannot carry state across values.

Generalizing over

The nice thing about over and zq is that any and all Zed operators can appear downstream of the over expression. In other words, any dataflow computation whatsoever can be performed upon the sequence of iterated values, not just inputs that fit into a memory-resident array.

Thus summing the numbers in the array of arrays from above can simply be expressed as:

$ echo '[1,2,3] [4,5,6]' | zq -j 'over this | sum(this)' -
{"sum":21}

Here, the over operator turns the array sequences into the continuous sequence 1,2,3,4,5,6 which is then summed up by the aggregate function.

Lateral Scopes

What if in the example above, we actually wanted to preserve the array boundaries like jq’s default behavior? This is where dataflow gets tricky: you want to have fine-grained control of how complex nested values are iterated over while also having precise access to fixed values nested above the iterated components.

Instead of always transforming [1,2,3][4,5,6] into the flattened sequence 1,2,3,4,5,6 you may want to create two subsequences 1,2,3 and 4,5,6 that are each operated upon to completion like jq.

After experimenting with several different structures for this, we landed upon the idea of “lateral scopes” that work in tandem with the over operator. To create a lateral scope you use the syntax => ( ... ) as in

$ echo '[1,2,3] [4,5,6]' | zq -j 'over this => (sum(this))' -
{"sum":6}
{"sum":15}

You can put any set of dataflow operators in the lateral scope (not just array functions like jq) and the scope’s output may be fed to any subsequent set of dataflow operators, which in turn, may also be stateful.

For example, if we wanted to compute the average of the sums, we can simply tack on a stateful avg aggregate function:

$ echo '[1,2,3] [4,5,6]' | zq -j 'over this => (sum(this)) | avg(sum)' -
{"avg":10.5}

Compare this with jq, where you have to slurp the entire input into an array and wrap partial results in partial-result arrays to emulate stateful dataflow.

$ echo '[1,2,3] [4,5,6]' | jq -s '[.[] | add] | {avg:(add/length)}'
{
  "avg": 10.5
}

Flattening Nested Values

A common problem with managing JSON data in analytics systems is unwinding or flattening nested arrays into flattened sets of “rows”. This is where lateral scopes come in handy.

Consider this simple example JSON pattern:

{"name":"foo","vals":[1,2,3]}

and suppose we want to convert every input like this into a sequence of flattened output values that looks like this:

{"name":"foo","val":1}
{"name":"foo","val":2}
{"name":"foo","val":3}

Using a lateral scope, we can traverse the vals array, import the name value into the scope, then in that scope, refer to each value as this and the imported value as name, and thus produce a flattened output value with a record constructor from inside the scope.

The zq query looks like this:

over vals with name => ({name,val:this})

The with clause is used to bring the name value into the lateral scope as the value for this in the outer scope is different from iterated values of this in inner scope.

Here is the idea in action:

$ echo '{"name":"foo","vals":[1,2,3]} {"name":"bar","vals":[4,5]}' | zq -j 'over vals with name => ({name,val:this})' -
{"name":"foo","val":1}
{"name":"foo","val":2}
{"name":"foo","val":3}
{"name":"bar","val":4}
{"name":"bar","val":5}

In contrast, we might use this jq query pattern to solve this unflattening problem:

$ echo '{"name":"foo","vals":[1,2,3]} {"name":"bar","vals":[4,5]}' | jq 'to_entries | {name:.[0].value,val:.[1].value[]}'

So you like chocolate or vanilla?

In this case, jq statelessness works pretty well but it comes down to a matter of taste.

Do you like chocolate or vanilla?

Which query would you rather write?

to_entries | {name:.[0].value,val:.[1].value[]}

or

over vals with name => ({name,val:this})

We think the latter is more intuitive. Once you learn how lateral scopes work with over, queries like this are easy to write. It’s fairly easy to commit this programming pattern to muscle memory.

On the other hand, with jq, you need to figure out to use a filter like to_entries, look in the manual to refresh your memory, then you’ll probably need to run

echo '{"name":"foo","vals":[1,2,3]} {"name":"bar","vals":[4,5,6]}' | jq 'to_entries'

to look out the resulting output structure of to_entries and finally, you can craft your downstream filter.

A Practical Example

The examples above are all a bit contrived, so let’s wrap up with some real-world data.

We’ll use the data and final query from Szymon Stepniak’s cookbook, which in turn uses the public OpenLibrary API.

The data for his example is easy to grab with curl. Just run:

curl -s "http://openlibrary.org/search.json?q=st%C4%99pniak" > openlibrary.json

Please have a look at Stepniak’s article if you’d like to see in detail how he reasoned through the building blocks to create this non-trivial query:

[.docs[] | {title,author_name: .author_name[0], publish_year: .publish_year[0]} | select(.author_name!=null and .publish_year!=null)] | group_by(.author_name)| [.[] | {author_name: .[0].author_name, count: . | length}] | sort_by(.count) | reverse | limit(3;.[])

This query

  • iterates over the field docs, which is an array, pulling out relevant fields,
  • filters null values out,
  • puts all these results in an array result to emulate stateful data flow,
  • performs a group-by arrangement of the array,
  • iterates over the group-by result to compute a count of the number of publications by author, and puts this result into another array again emulating stateful data flow,
  • sorts this array by the count field,
  • reverses the array elements,
  • then limits the output to the first three elements given the top-publications by author.

Running this (as of the data obtained on April 19, 2022)…

jq '[.docs[] | {title,author_name: .author_name[0], publish_year: .publish_year[0]} | select(.author_name!=null and .publish_year!=null)] | group_by(.author_name)| [.[] | {author_name: .[0].author_name, count: . | length}] | sort_by(.count) | reverse | limit(3;.[])' openlibrary.json

gives the following results:

{
  "author_name": "S. Stepniak",
  "count": 38
}
{
  "author_name": "Władysław Stępniak",
  "count": 7
}
{
  "author_name": "Władysław Stępniak",
  "count": 4
}

This is a surprise! The answer here differs from the results in the cookbook article, and oddly here, there are two line items for the same name “Władysław Stępniak”!

We will get back to this glitch in a second, but let’s have a look how you would write this same query in zq. Here it is:

over docs
| {title, author_name: author_name[0], publish_year: publish_year[0]}
| has(author_name) and has(publish_year)
| count() by author_name | sort -r count | head 3

We use the over iterator, a record constructor, the has function for missing fields/nulls, the count aggregate function, and the Unix-like sort and head operators.

If we put the above query in a file called query.zed and run this:

zq -j -I query.zed openlibrary.json

we get the following result:

{"author_name":"S. Stepniak","count":38}
{"author_name":"Władysław Stępniak","count":11}
{"author_name":"Andrzej Stępniak","count":4}

This time the answer looks a bit more reasonable as there aren’t redundant author_name values from the group-by.

But why are the two answers different?

After a little sleuthing, we discovered that the problem in the jq query arises from Unicode equivalence, i.e., there are two different ways to encode the ł character. As a result, jq places the different Unicode strings into two different group-by buckets.

This surprising behavior can be reproduced with this simple jq command:

$ jq -n '"Władysław Stępniak"=="Władysław Stępniak"'
false

The strings appear the same but their equality test is false!

Welcome to the world of messy data.

Cleaning up the Unicode strings

So what should we do here?

To avoid such problems, zq uses Unicode normalization so that the two variations of the same string are treated equally thus giving the “correct answer”. In fact,

zq 'yield "Władysław Stępniak"=="Władysław Stępniak"'

produces true, as it should.

It would be nice if jq could normalize Unicode strings at least with a helper function, but after poking around a bit, I couldn’t find any related support.

However, zq supports this just fine. So, let’s just pass the JSON data through zq before processing it with jq, and voila, jq now gives the correct answer!

$ zq -j openlibrary.json | jq '[.docs[] | {title,author_name: .author_name[0], publish_year: .publish_year[0]} | select(.author_name!=null and .publish_year!=null)] | group_by(.author_name)| [.[] | {author_name: .[0].author_name, count: . | length}] | sort_by(.count) | reverse | limit(3;.[])'
{
  "author_name": "S. Stepniak",
  "count": 38
}
{
  "author_name": "Władysław Stępniak",
  "count": 11
}
{
  "author_name": "Andrzej Stępniak",
  "count": 4
}

But is it Really Faster?

The title of this article implies parenthetically that zq is faster than jq but we haven’t addressed performance in the examples above.

We will cover zq’s performance in a future article, but to cut to the chase here, zq is almost always at least a bit faster than jq when processing JSON inputs, and it is remarkably faster (5-100X) if you process data in Zed’s native format ZNG.

If you’d like to learn more, there are some performance comparisons in the zq documentation.

Wrapping Up

Okay, this is all a lot to digest, and we’ve just scratched the surface of zq and the Zed project. Given this short introduction, we hope you’ll agree that we’ve tackled some of the harder-to-use aspects of jq.

The point here isn’t that zq is a cool and fun tool. Of course it is, but more broadly, we have a very serious ambition: to make data 10X easier than it is today, changing the way people approach data problems and meaningfully improving the economics of data systems.

The dichotomy between schema-flexible JSON and schema-rigid relational tables pervades the modern data stack and frankly makes things hard and brittle. We’ll write about this in future articles but our view is that today’s solutions for fitting JSON into tables or running SQL queries over JSON are like putting a square peg in a round hole. There’s a ton of duct tape out there.

It’s hard to make things easy, and with zq and Zed, we’re trying to make exploring, transforming, and managing data even easier than jq. If you could run jq-like queries at cloud warehouse scale, wouldn’t that be powerful? No need to define tables and schemas every time you want to look at a new and different set of data.

These topics and more will all be laid out in forthcoming articles over the coming months. In the meantime, if you want to learn more about Zed:

We love working with all our users to help guide us to the best ways to solve your real, everyday problems. Give us a holler and we look forward to chatting.