Whether for website analytics or determining the origin of a network attack, Internet geolocation is handy technology. In brief, given an IP address associated with some network traffic, can we deduce from where on earth the traffic originated? There’s plenty of software and services out there to help provide this information, but a familiar one is the GeoLite2 data from MaxMind. Not only is it free, it’s provided in the form of downloadable data sets. This allows us to pick it apart in interesting ways.
MaxMind makes their GeoLite2 data available in a couple formats, one being the common (and sometimes loathed) CSV format. They provide helpful docs that describe the schemas and a handy cookbook for defining SQL tables, importing the CSV data into those tables, and then performing a join between two of the data sets. This guide is super helpful, and if followed to the letter, you can experience the joy at the end of entering an IP address and getting back a city name. But before you can get there, as a prerequisite you need to enumerate all the field names and desired data types to load it into a specific structured format. Maybe there’s another way?
To validate the Zed project’s usefulness in real-world use cases, I often apply Zed to data sets and guides like these. In Zed’s docs, we sometimes speak of how it offers a gradual slope. Whereas MaxMind’s guide effectively shows how “Your CSV data must become this SQL to ride this ride!”, Zed allows you to apply a minimal set of concepts from the language, data model, and tooling to solve your problem.
Replicating these operations on the GeoLite2 data with Zed shows a nice example of this gradual slope in action. Read on to climb the slope from a quick one-shot query to a performant join that avoids the hassle of SQL tables.
Examining the Data
Let’s say we’ve downloaded and unpacked the most recent GeoLite2 City: CSV Format ZIP. Much like MaxMind’s guide, we’ll focus on these three CSV files.
$ ls -l *Blocks* *en*
-rw-rw-r--@ 1 phil staff 242435417 Oct 16 08:36 GeoLite2-City-Blocks-IPv4.csv
-rw-rw-r--@ 1 phil staff 81213494 Oct 16 08:36 GeoLite2-City-Blocks-IPv6.csv
-rw-rw-r--@ 1 phil staff 10036753 Oct 16 08:36 GeoLite2-City-Locations-en.csv
To peek at some sample data, we could refer to MaxMind’s description of the schemas or load the CSV files into tools like Excel or Google Sheets. However, Zed provides a handy point of entry thanks to its ability to auto-detect most common input formats. With a quick one-shot invocation of the zq
command-line tool, we can use Zed’s head
operator to see samples of the data.
$ zq -Z 'head' GeoLite2-City-Blocks-IPv4.csv
{
network: "1.0.0.0/24",
geoname_id: 2077456.,
registered_country_geoname_id: 2077456.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: null,
latitude: -33.494,
longitude: 143.2104,
accuracy_radius: 1000.
}
$ zq -Z 'head' GeoLite2-City-Locations-en.csv
{
geoname_id: 1392.,
locale_code: "en",
continent_code: "AS",
continent_name: "Asia",
country_iso_code: "IR",
country_name: "Iran",
subdivision_1_iso_code: 2.,
subdivision_1_name: "Māzandarān",
subdivision_2_iso_code: null,
subdivision_2_name: null,
city_name: "Shahr",
metro_code: null,
time_zone: "Asia/Tehran",
is_in_european_union: 0.
}
Compared to the SQL cookbook, notice that we didn’t need to enumerate any column names or specify data types in order to read the CSV files. Much like we see with spreadsheet tools, Zed has smart default behaviors when reading CSV inputs, recognizing which fields can be treated as numbers or strings and where null
values appear. For many one-shot queries, this treatment may be all that’s needed. For example, let’s say we wanted to see the cities in Iran.
$ zq -f text 'country_name=="Iran" | cut city_name' GeoLite2-City-Locations-en.csv
Shahr
Jahan
Kaman
...
Richer Data Types Enable Richer Queries
Now that we’re familiar with our data, let’s revisit our original goal of mapping an IP address to its geographical location.
We see the “blocks” data is organized by CIDR block. Since it came from CSV, the network
field starts life as a string. The SQL cookbook showed how to turn this string into a column with PostgreSQL’s cidr type and then use their >>=
operator to check if an IP is inside such a block.
Keeping with the theme of a gradual slope, we can apply the equivalent Zed concepts only where needed and rely on the default treatment of all other fields. Here’s what happens when we cast the network
field to Zed’s net
type.
$ zq -Z 'network:=net(network) | head' GeoLite2-City-Blocks-IPv4.csv
{
network: 1.0.0.0/24,
geoname_id: 2077456.,
registered_country_geoname_id: 2077456.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: null,
latitude: -33.494,
longitude: 143.2104,
accuracy_radius: 1000.
}
Notice the lack of quotes around the value 1.0.0.0/24
, indicating it’s no longer treated as a string. Now a simple search using Zed’s cidr_match
function can show us the block that covers a particular IP address.
$ zq -Z 'network:=net(network) | cidr_match(network, 146.243.121.22)' GeoLite2-City-Blocks-IPv4.csv
{
network: 146.243.120.0/23,
geoname_id: 4952487.,
registered_country_geoname_id: 6252001.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: 2777.,
latitude: 41.7505,
longitude: -71.2089,
accuracy_radius: 20.
}
Type casting is but one small part of the wider topic of Zed’s shaping and type fusion features. These flexible and powerful data transformation techniques can provide the “enumerate-every-field-and-data-type” treatment often seen with SQL… but only when you need to! Since we’re already making good progress having only surgically altered one field, let’s keep going and see how far we get.
Joining to City Data
We’ve mapped our IP address into a subnet, but that’s only gotten us details like postal code and latitude/longitude. This might be adequate for plotting points on a map, but what if we want to see a city name?
MaxMind’s cookbook describes the significance of the geoname_id
field appearing in both data sets, then shows how to leverage it to perform a SQL join to get the city detail.
Zed has its own join
operator and we can apply it here, once again without having had to turn our CSV into SQL tables. We’ll use Zed’s file
operator to feed our multiple data sources into the respective left and right sides of the join
. Finally, since our program is getting longer, we’ll spread it out over multiple lines.
$ zq -Z '
file GeoLite2-City-Blocks-IPv4.csv
| network:=net(network)
| cidr_match(network, 146.243.121.22)
| join (
file GeoLite2-City-Locations-en.csv
) on geoname_id=geoname_id location:=this
'
{
network: 146.243.120.0/23,
geoname_id: 4952487.,
registered_country_geoname_id: 6252001.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: 2777.,
latitude: 41.7505,
longitude: -71.2089,
accuracy_radius: 20.,
location: {
geoname_id: 4952487.,
locale_code: "en",
continent_code: "NA",
continent_name: "North America",
country_iso_code: "US",
country_name: "United States",
subdivision_1_iso_code: "MA",
subdivision_1_name: "Massachusetts",
subdivision_2_iso_code: null,
subdivision_2_name: null,
city_name: "Swansea",
metro_code: 521.,
time_zone: "America/New_York",
is_in_european_union: 0.
}
}
SQL users know join is a big topic all its own, and indeed, there’s a Zed join
tutorial that walks through examples using the many available flavors. However, if the example in MaxMind’s cookbook made sense to you, hopefully you’ll find the Zed one above equally intuitive. In brief, the logic is:
- Filter the “blocks” input down to just the record containing our target CIDR block
- Locate the “city” record that matches our “block” by finding shared
geoname_id
values - Copy the matching “city” data into our “block” record as a nested field called
location
Wanna Jump in a Lake?
If all we needed was this one-shot result that our IP address is in Swansea, we’re done. We’d have saved ourselves plenty of time by not having to install database software, define tables, and load CSV files into them all as a prerequisite to running our join.
But what if we expect to run this kind of query repeatedly? On my aging MacBook, the query above took about 14 seconds with zq
. Considering we read a few hundred megabytes of CSV on the fly, transformed a field, performed a CIDR match, then did a join between two data sources, maybe that’s a reasonable cost. But it’s going to add up over multiple queries.
This brings us to our next step up the gradual slope: The Zed lake. The Zed lake is another topic that can be explored to needed depth, but for our immediate needs we can think of it as a place to more efficiently store and query our data. With a Zed service running, the following commands load our minimally-scrubbed CSV inputs into respective pools of the lake.
$ zed create -orderby network blocks
$ zed create -orderby geoname_id locations
$ zq 'network:=net(network)' GeoLite2-City-Blocks-IPv*.csv | zed load -use blocks -
$ zed load -use locations GeoLite2-City-Locations-en.csv
We’ve created pools called “blocks” and “locations” to hold the data from our respective CSV inputs. The -orderby
option allows us to specify a pool key field by which data will be sorted when stored. This has the potential to significantly speed up queries involving this field, much like a primary key does for SQL. Since we still want the benefits of the minimal casting of the network
field in our “blocks” data, we repurpose our zq
from earlier to perform that transform and feed that output into zed load
to populate that pool. And since we don’t need to do any transforms on the city data, a straight call to zed load
is adequate to populate that pool. Once loaded, all lake data is stored in Zed’s compact binary ZNG format.
By going through these one-time steps, we can now avoid repeating the expensive operations of CSV parsing or type casting in the future and instead jump right to executing efficient queries. Since we’re now hitting pools in the lake, the Zed query syntax changes minimally compared to what we saw with zq
and file inputs. Let’s wrap our query with time
and see how much faster it runs.
$ time zed query -Z '
from blocks
| cidr_match(network, 146.243.121.22)
| join (
from locations
) on geoname_id=geoname_id location:=this
'
{
network: 146.243.120.0/23,
geoname_id: 4952487.,
registered_country_geoname_id: 6252001.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: 2777.,
latitude: 41.7505,
longitude: -71.2089,
accuracy_radius: 20.,
location: {
geoname_id: 4952487.,
locale_code: "en",
continent_code: "NA",
continent_name: "North America",
country_iso_code: "US",
country_name: "United States",
subdivision_1_iso_code: "MA",
subdivision_1_name: "Massachusetts",
subdivision_2_iso_code: null,
subdivision_2_name: null,
city_name: "Swansea",
metro_code: 521.,
time_zone: "America/New_York",
is_in_european_union: 0.
}
}
real 0m0.193s
user 0m0.020s
sys 0m0.013s
Below 0.2 seconds? Not bad!
Parameters From the Shell
We’ve got good momentum going, so let’s take one more step up the gradual slope and see how to generalize this approach with Zed’s user-defined operators. This allows you to parameterize the variable portions of your logic to hide complexity and enable reusability.
Since the IP address is what’s changing here, we turn that into an argument addr
. We’ll also store our program in a text file loc.zed
.
op find_location(addr): (
from blocks
| cidr_match(network, addr)
| join (
from locations
) on geoname_id=geoname_id location:=this
)
Zed’s CLI tools provide a -I
option to invoke programs from such files. Therefore we can now run the query from the shell like so.
$ zed query -Z -I loc.zed 'find_location(146.243.121.22)'
{
network: 146.243.120.0/23,
geoname_id: 4952487.,
registered_country_geoname_id: 6252001.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: 2777.,
latitude: 41.7505,
longitude: -71.2089,
accuracy_radius: 20.,
location: {
geoname_id: 4952487.,
locale_code: "en",
continent_code: "NA",
continent_name: "North America",
country_iso_code: "US",
country_name: "United States",
subdivision_1_iso_code: "MA",
subdivision_1_name: "Massachusetts",
subdivision_2_iso_code: null,
subdivision_2_name: null,
city_name: "Swansea",
metro_code: 521.,
time_zone: "America/New_York",
is_in_european_union: 0.
}
}
And to achieve total shell nirvana, let’s dynamically query for our own IP and use Zed’s yield
operator to see what city comes back.
$ zed query -Z -I loc.zed "find_location($(curl -s ifconfig.co)) | yield location.city_name"
"San Francisco"
As validated by the smell of sourdough bread emanating from my oven, this result is indeed correct!
Wrapping Up
Hopefully we’ve given you a better understanding of Zed’s gradual slope. Whether you need a one-shot query result to get on with your day or you’re looking to build a reusable production data pipeline, Zed lets you jump on board with the appropriate level of sophistication and performance.
Of course, this only scratches the surface of what’s possible. In addition to the links above, here are some suggestions of other areas to explore.
- If you like what you see, support Zed by giving a GitHub Star Star
- Download the Zed CLI tooling and/or the Zui desktop app
- Check out the Zed repo on GitHub
- Read the Zed research paper