Geoip geolocation with Google BigQuery

MAR 03, 2014
Author PhotoBy Felipe Hoffa, Cloud Platform team

Cross-posted from the Google Cloud Platform Blog

Aggregating numbers by geolocation is a powerful way to analyze your data, but not an easy task when you have millions of IP addresses to analyze. In this post, we'll check how we can we use Google BigQuery to quickly solve this use case using a publicly available dataset.

We take the developer community seriously and it’s a great way for us to see what your use cases are. This is where I found a very interesting question: "user2881671" on Stack Overflow had created a way to transform IP addresses into geographical locations in BigQuery, and asked for help optimizing their query. We worked out an optimized solution there, and today I'm happy to present an even better solution.

For example, if you want to peek at what are the top cities contributing modifications to Wikipedia, you can run this query:
SELECT COUNT(*) c, city, countryLabel, NTH(1, latitude) lat, NTH(1, longitude) lng
      FROM (
      SELECT
      INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
      INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
      FROM
      [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      ) AS a
      JOIN EACH [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
      ON a.classB = b.classB
      WHERE a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
      AND city != ''
      GROUP BY city, countryLabel
      ORDER BY 1 DESC
      
We can visualize the results on a map:


You can do the same operation with your own tables containing ipv4 IP addresses. Just take the previous query and replace [publicdata:samples.wikipedia] with your own table, and contributor_ip with the name of its column containing ipv4 addresses.

Technical details

First, I downloaded the Creative Commons licensed GeoLite City IPv4 made available by MaxMind in its .csv format. There is a newer database available too, but I didn't work with it as it's only available in binary form for now. I uploaded its 2 tables into BigQuery: blocks and locations.

To get better performance later, some processing was needed: For each rule I extracted into a new column its class B prefix (192.168.x.x) and generated duplicate rules for segments that spanned more than one B class. I also joined both original tables, to skip that step when processing data. In the StackOverflow question "user2881671" went even further, generating additional rules for segments without a location mapping (cleverly using the LAG() window function), but I skipped that step here (so addresses without a location will be skipped rather than counted). In total, only 32,702 new rows were needed.

The final query JOINs the class B prefix from your IP addresses with the lookup table, to prevent the performance hit of doing a full cross join.

You can find the new table with the BigQuery web UI, or using the REST-based API to integrate these queries and dataset with your own software.

To get started with BigQuery, you can check out our site and the "What is BigQuery" introduction. You can post questions and get quick answers about BigQuery usage and development on Stack Overflow. Follow the latest BigQuery news at www.reddit.com/r/bigquery. We love your feedback and comments. Join the discussion on +Google Cloud Platform using the hashtag #BigQuery.


This post includes GeoLite data created by MaxMind, available from http://www.maxmind.com, distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.


Felipe Hoffa is part of the Cloud Platform Team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Scott Knaster, Editor