Using Google BigQuery to learn from GitHub data
By Ilya Grigorik, Web Performance Engineer
Open-source developers all over the world contribute to millions of projects every day:
writing and reviewing code, filing and discussing bug reports, updating documentation and
project wikis, and so forth. The data generated from this activity can reveal interesting
trends across many industries, including popularity of programming languages over time, defect
rates, contribution metrics, and popularity of specific frameworks and libraries.
The challenge in extracting these trends is gathering the data. Each project has its own
distributed workflow, code repositories, and conventions. Having hosted dozens of my own
projects on GitHub, I've long wanted to analyze the developer activity from the 2.6M+ public
projects hosted on GitHub. Hence, earlier this year GitHub Archive was born!
GitHub Archive is a project to record
the public GitHub timeline, archive it, and make it easily accessible for further analysis.
Each day it archives over 120,000 public activities, ranging from new commits and fork events
to opening and closing tickets, each with detailed metadata.
Once I collected the data, I needed a tool to analyze it, and that is when I found
Google BigQuery. Based on the
research behind
Dremel, a popular internal
tool at Google for analyzing web-scale datasets, BigQuery allowed me to easily import the
entire dataset and use a familiar SQL like syntax to comb through the gigabytes of data in
seconds. Plus the tool will scale to terabyte datasets, so there is plenty of room to
grow!
The best news is that thanks to collaboration from the GitHub and BigQuery teams, the GitHub
dataset is now public and available for you to slice and dice in any way you like. No need to
worry about data gathering or database schemas: BigQuery will do all the heavy lifting, and
you can just compose your queries to be executed in realtime.
Here's a real-world example. What are the most popular programming languages on GitHub over
the past month?
If you are curious for
more,
sign up for
BigQuery and follow the instructions on
githubarchive.org to access the GitHub dataset.
You can use the free 100GB query quota to run your analysis and perhaps even win some of the
prizes from the
GitHub
Data Challenge!
Ilya Grigorik is a Web Performance Engineer and Advocate at Google, an open-source
evangelist, and an analytics geek. You can find him on GitHub under igrigorik, and blogging about web performance
at igvita.com.
Posted by Scott Knaster,
Editor