Geoip geolocation with Google BigQuery
By Felipe Hoffa, Cloud Platform team
Cross-posted from the Google
Cloud Platform Blog
Aggregating numbers by geolocation is a powerful way to analyze your data, but not an easy
task when you have millions of IP addresses to analyze. In this post, we'll check how we can
we use
Google BigQuery
to quickly solve this use case using a publicly available dataset.
We take the developer community seriously and it’s a great way for us to see what your use
cases are. This is where I found a very interesting question: "user2881671" on
Stack
Overflow had created a way to
transform
IP addresses into geographical locations in BigQuery, and asked for help optimizing
their query. We worked out an optimized solution there, and today I'm happy to present an even
better solution.
For example, if you want to peek at what are the top cities contributing modifications to
Wikipedia, you can run this query:
SELECT COUNT(*) c, city, countryLabel, NTH(1, latitude) lat, NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
) AS a
JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON a.classB = b.classB
WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
AND city != ''
GROUP BY city, countryLabel
ORDER BY 1 DESC
We can visualize the results on a map:
You can do the same operation with your own tables containing ipv4 IP addresses. Just take the
previous query and replace [
publicdata:samples.wikipedia]
with your own table, and contributor_ip with the name of its column containing ipv4
addresses.
Technical details
First, I downloaded the Creative Commons licensed
GeoLite City IPv4 made available by
MaxMind in its .csv format. There is a
newer database available
too, but I didn't work with it as it's only available in binary form for now. I uploaded its 2
tables into BigQuery:
blocks and
locations.
To get better performance later, some processing was needed: For each rule I extracted into a
new column its class B prefix (192.168.x.x) and generated duplicate rules for segments that
spanned more than one B class. I also joined both original tables, to skip that step when
processing data. In the StackOverflow question "user2881671" went even further, generating
additional rules for segments without a location mapping (cleverly using the LAG() window
function), but I skipped that step here (so addresses without a location will be skipped
rather than counted). In total, only 32,702 new rows were needed.
The final query JOINs the class B prefix from your IP addresses with the lookup table, to
prevent the performance hit of doing a full cross join.
You can find
the
new table with the BigQuery web UI, or using the REST-based API to integrate these
queries and dataset with your own software.
To get started with BigQuery, you can
check out our
site and the
"What is BigQuery"
introduction. You can post questions and get quick answers about BigQuery usage and
development on
Stack
Overflow. Follow the latest BigQuery news at
www.reddit.com/r/bigquery. We love your
feedback and comments. Join the discussion on
+Google Cloud Platform
using the hashtag
#BigQuery.
This post includes GeoLite data created by MaxMind, available from http://www.maxmind.com, distributed under the
Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Felipe Hoffa is part
of the Cloud Platform Team. He'd love to see the world's data accessible for everyone in
BigQuery.
Posted by Scott Knaster,
Editor