Updates to Google BigQuery following Cloud Platform Live

MAR 31, 2014
Author PhotoBy Felipe Hoffa, Cloud Platform team

Cross-posted from the Google Cloud Platform Blog
Editor's note: This post is a follow-up to the announcements we made on March 25th at Google Cloud Platform Live.

Last Tuesday we announced an exciting set of changes to Google BigQuery making your experience easier, faster and more powerful. In addition to new features and improvements like table wildcard functions, views, and parallel exports, BigQuery now features increased streaming capacity, lower pricing, and more.

1000x increase in streaming capacity

Last September we announced the ability to stream data into BigQuery for instant analysis, with an ingestion limit of 100 rows per second. While developers have enjoyed and exploited this capability, they've asked for more capacity. You now can stream up to 100,000 rows per second, per table into BigQuery - 1,000 times more than before.

For a great demonstration of the power of streaming data into BigQuery, check out the live demo from the keynote at Cloud Platform Live.

Users often partition their big tables into smaller units for data lifecycle and optimization purposes. For example, instead of having yearly tables, they could be split into monthly or even daily sets. BigQuery now offers table wildcard functions to help easily query tables that match common parameters.

The downside of partitioning tables is writing queries that need to access multiple tables. This would be easier if there was a way to tell BigQuery "process all the tables between March 3rd and March 25th" or "read every table which names start with an 'a'". You can do this with this release.

TABLE_DATE_RANGE() queries all tables that overlap with a time range (based on the table names), while TABLE_QUERY() accepts regular expressions to select the tables to analyze.

For more information, see the documentation and syntax for table wildcard functions.

Improved SQL support and table views

BigQuery has adopted SQL as its query language because it's one of the most well known, simple and powerful ways to analyze data. Nevertheless BigQuery used to impose some restrictions on traditional SQL-92, like having to write multiple sub-queries instead of simpler multi-joins. Not anymore, now BigQuery supports multi-join and CROSS JOIN, and improves its SQL capabilities with more flexible alias support, fewer ORDER BY restrictions, more window functions, smarter PARTITION BY, and more.

A notable new feature is the ability to save queries as views, and use them as building blocks for more complex queries. To define a view, you can use the browser tool to save a query, the API, or the newest version of the BigQuery command-line tool (by downloading the Google Cloud SDK).

User-defined metadata

Now you can annotate each dataset, table, and field with descriptions that are displayed within BigQuery. This way people you share your datasets with will have an easier time identifying them.

JSON parsing functions

BigQuery is optimized for structured data: before loading data into BigQuery, you should first define a table with the right columns. This is not always easy, as JSON schemas might be flexible and in constant flux. BigQuery now lets you store JSON encoded objects into string fields, and you can use the JSON_EXTRACT and JSON_EXTRACT_SCALAR functions to easily parse them later using JSONPath-like expressions.

For example:
SELECT json_extract_scalar(
      "{'book': {
      'title':'Harry Potter'}}",

Fast parallel exports

BigQuery is a great place to store all your data and have it ready for instant analysis using SQL queries. But sometimes SQL is not enough, and you might want to analyze your data with external tools. That's why we developed the new fast parallel exports: With this feature, you can define how many workers will be consuming the data, and BigQuery exports the data to multiple files optimized for the available number of workers.

Check the exporting data documentation, or stay tuned for the upcoming Hadoop connector to BigQuery documentation.

Massive price reductions

At Cloud Platform live, we announced a massive price reduction: Storage costs are going down 68%, from 8 cents per gigabyte per month to only 2.6, while querying costs are going down 85%, from 3.5 cents per gigabyte to only 0.5. Previously announced streaming costs are now reduced by 90%. And finally, we announced the ability to purchase reserved processing capacity, for even cheaper prices and the ability to precisely predict costs. And you always have the option to burst using on-demand capacity.

I want to take this space to celebrate the latest open source community contributions to the BigQuery ecosystem. R has its own connector to BigQuery (and a tutorial), as Python pandas too (check out the video we made with Pearson). Ruby developers are now able to use BigQuery with an ActiveRecord connector, and send all their logs with fluentd. Thanks all, and keep surprising us!

Felipe Hoffa is part of the Cloud Platform Team. He'd love to see the world's data accessible for everyone in BigQuery.

Posted by Louis Gray, Googler