Google BigQuery new features: bigger, faster, smarter
By Felipe
Hoffa, Cloud Platform team
Google BigQuery is
designed to make it easy to analyze large amounts of data quickly.
Today
we announced several updates that give BigQuery the ability to handle arbitrarily
large result sets, use window functions for advanced analytics, and cache query results. You
are also getting new UI features, larger interactive quotas, and a new convenient tiered
pricing scheme. In this post we'll dig further into the technical details of these new
features.
Large results
BigQuery is able to process terabytes of data, but until today BigQuery could only output up
to 128 MB of compressed data per query. Many of you asked for more and from now on BigQuery
will be able to output results as large as the largest tables our customers have ever
had.
To get this benefit, you should enable the new "
--allow_large_results
"
flag when issuing a query job, and specify a destination table. All results will be saved to
the new specified table (or appended, if the table exists). In the updated web UI these
options can be found under the new "Enable Options" menu.
With this feature, you can run big transformations on your tables, plus get big subsets of
data to further analyze from the new table.
Analytic functions
BigQuery's power is in the ability to interactively run aggregate queries over terabytes of
data, but sometimes counts and averages are not enough. That's why BigQuery also lets you
calculate quantiles, variance and standard deviation, as well as other
advanced
functions.
To make BigQuery even more powerful, today we are adding support for
window
functions (also known as "analytical functions") for ranking, percentiles, and
relative row navigation. These new functions give you different ways to rank results, explore
distributions and percentiles, and traverse results without the need for a self join.
To introduce these functions with an advanced example, let's use the dataset we collected from
the
Data Sensing Lab at Google
I/O. With the
percentile_cont()
function it's easy to get
the median temperature over each room:
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
In this example, each original data row shows the median temperature for each room. To
visualize it better, it's a good idea to group all results by room with an outer query:
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
We can add an additional outer query, to rank the rooms according to which one had the coldest
median temperature. We'll use one of the new ranking window functions,
dense_rank()
:
SELECT DENSE_RANK() OVER (ORDER BY median) rank, median, room FROM (
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
)
We've updated the
documentation
with descriptions and examples for each of the new window functions. Note that they require
the
OVER()
clause, with an optional
PARTITION
BY
and sometimes required
ORDER BY
arguments.
ORDER BY
tells the window function what criteria to use to rank
items, while
PARTITION BY
allows you to define multiple groups to be
analyzed independently of each other.
The window functions don't work with the big
GROUP EACH BY
and
JOIN EACH BY
operators, but they do work with the traditional
GROUP BY
and
JOIN BY
. As a reminder,
we
announced GROUP EACH BY
and
JOIN EACH
BY
last March, to allow large join and group operations.
Query caching
BigQuery now remembers values that you've previously computed, saving you time and the cost of
recalculating the query.
To maintain privacy, queries are cached on a per-user basis. Cached results are only returned
for tables that haven't changed since the last query, or for queries that are not dependent on
non-deterministic parameters (such as the current time). Reading cached results is free, but
each query still counts against the max number of queries per day quota.
Query results are kept cached for 24 hours, on a best effort basis.
You can disable query caching with the new flag
--use_cache
in bq, or "
useQueryCache
"
in the API. This feature is also accessible with the new query options on the BigQuery Web
UI.
BigQuery Web UI: Query validator, cost estimator, and abandonment
The
BigQuery UI gets even better:
You'll get instant information while writing a query if its syntax is valid. If the syntax is
not valid, you'll know where the error is. If the syntax is valid, the UI will inform you how
much the query would cost to run. This feature is also available with the bq tool and API,
using the
--dry_run
flag.
An additional improvement: When running queries on the UI, previously you had to wait until
its completion before starting another one. Now you have the option to abandon it, to start
working on the next iteration of the query without waiting for the abandoned one.
Pricing updates
Starting in July, BigQuery
pricing becomes more
affordable for everyone: Data storage costs are going from $0.12/GB/month to $0.08/GB/month.
And if you are a high-volume user, you'll soon be able to opt-in for tiered query pricing, for
even better value.
Bigger quota
To support larger workloads we're
doubling interactive query
quotas for all users, from 200GB + 1 concurrent query, to 400 GB of concurrent
queries + 2 additional queries of unlimited size.
These updates make BigQuery a faster, smarter, and even more affordable solution for ad hoc
analysis of extremely large datasets. We expect they'll help to scale your projects, and we
hope you'll share your use cases with us on
Google+.
The BigQuery UI features a collection of public datasets for you to use when trying
out these new features. To get started, visit our sign-up page and Quick Start guide. You
should take a look at our API docs, and ask questions about BigQuery development on Stack
Overflow. Finally, don't forget to give us feedback and join the discussion on our Cloud
Platform Developers Google+ page.
Felipe Hoffa has recently joined the Cloud Platform team. He'd love to see the
world's data accessible for everyone in BigQuery.
Posted by Ashleigh Rentz, Editor
Emerita