Got big JSON? BigQuery expands data import for large scale web apps
By Ryan Boyd,
Developer Advocate
JSON is the data format of the web. JSON is used to power most modern websites, is a native
format for many NoSQL databases hosting top web applications, and provides the primary data
format in many REST APIs.
Google
BigQuery, our cloud service for ad-hoc analytics on big data, has now added support
for JSON and the nested/repeated structure inherent in the data format.
JSON opens the door to a more object-oriented
view of your data compared to CSV, the original data format supported by BigQuery. It removes
the need for duplication of data required when you flatten records into CSV. Here are some
examples of data you might find a JSON format useful for:
- Log files, with multiple headers and other name-value pairs.
- User session activities, with information about each activity occurring nested
beneath the session record.
- Sensor data, with variable attributes collected in each measurement.
Nested/repeated data support is one of our most requested features. And while
BigQuery's underlying infrastructure supports it, we'd only enabled it in a limited fashion
through M-Lab's test data. Today, however, developers can use JSON to get any nested/repeated
data into and out of BigQuery.
For more information on importing JSON and nested/repeated data into BigQuery, check out the
new
guide in our documentation. You should also see the
Dealing with Data
section for details on the new querying syntax available for this type of
data.
Improvements to Data Loading Pipeline
We’ve made it much easier to ingest data into BigQuery – up to 1TB of data per load job, with
each file up to 100GB uncompressed JSON or CSV. We’ve also eliminated the 2 imports per minute
rate limit, enabling you to submit all your ingestion jobs and let us handle the queuing as
necessary. In a recent project I’ve been working on, import jobs for 3TB of data that
previously took me 12 hours to run now take me only 36 minutes –
a 20x
improvement!
We’ve published a new
Ingestion
Cookbook that explains how to take advantage of these
new limits.
We’re initiating a small trusted tester program aimed at making it easier to move your data
from the App Engine Datastore to BigQuery for analysis. If you store a lot of data in
Datastore and are also using BigQuery, we’d like to hear from you. Please
sign
up now to be considered for the trusted tester program.
Learn more this week
Michael
Manoochehri, Siddartha Naidu and
I are in London this
week talking about BigQuery and these new features at the
Strata big data conference.
Ju-kay Kwek will
also be talking about BigQuery at the
Interop NYC
conference tomorrow. Please stop by, say hi, and let us know what you’re doing with
big data.
We’ll also be producing a
Google
Developers Live session from
Campus
London on Friday at 16:00 BST (15:00 GMT).
Ryan
Boyd is a Developer Advocate, focused on big data. He's been at Google for 6 years
and previously helped build out the Google Apps ISV ecosystem. He published his first book,
"Getting Started with OAuth 2.0", with O'Reilly.
Posted by Scott Knaster,
Editor