BigQuery gets big new features to make data analysis even easier
By Michael Manoochehri, Developer Programs Engineer, Cloud Platform
Google BigQuery is designed to make it easy to analyze large amounts of data quickly.
Overwhelmingly, developers have asked us for features to help simplify their work even
further. Today we are launching a collection of updates that gives BigQuery a greater range of
query and data types, more flexibility with table structure, and better tools for
collaborative analysis.
Big JOIN and Big Group Aggregations
Extracting insights from multiple datasets can be challenging and time-consuming. This is
especially true when datasets become too large to query with a traditional database system.
With traditional databases, SQL functions like joining and grouping are typically used to
bring together data for analysis. What happens when your data is too large to fit into a
conventional database? Working with multi-terabyte datasets often requires developing
complicated MapReduce workflows, investing in expensive infrastructure, and great deal of
time. Very often, it's a combination of all three.
In response to developer feedback, we're launching new features that enable analysts and
developers to run fast SQL-like join and aggregate queries on datasets without the need for
batch-based processing. Our new
Big JOIN
feature gives users the ability to produce a result set by merging data from two large tables
by a common key. Big JOIN simplifies data analysis that would otherwise require a data
transformation step, by allowing users to specify JOIN operations using SQL.
Popular web applications produce user activity logs that can grow by billions of rows each
week. Dividing users into smaller groups is a key step for analysis. However, each group of
users can number in the millions. To handle this for such large volumes, we've enabled Big
Group Aggregations, which significantly increases the number of distinct values that can be
grouped in a result set.
To use these new features, simply add the EACH modifier to JOIN or GROUP BY clauses.
/* JOIN EACH example
* Selects the top 10
most edited Wikipedia pages
* of words that appear
in works of Shakespeare.
*/
SELECT
TOP(wiki.title, 10), COUNT(*)
FROM
[publicdata:samples.wikipedia] AS wiki
JOIN
EACH
[publicdata:samples.shakespeare] AS shakespeare
ON
shakespeare.word = wiki.title; |
For more information, including best practices, when using JOIN EACH and GROUP EACH BY, visit
the BigQuery
query
reference.
Native support for TIMESTAMP data type
We are also adding a new TIMESTAMP data type, in response to one of our most frequent requests
from developers. This new data type lets you import date and time values in formats familiar
to users of databases such as MySQL, while preserving timezone offset information.
Along with the new data type come new functions for converting TIMESTAMP fields into other
formats, calculating intervals, and extracting components such as the hour, day of week, and
quarter.
/* TIMESTAMP example
* Which hours in the
day are the most popular for GitHub actions?
* This query converts
github_timeline "created_at" date time
* strings to BigQuery
TIMESTAMP, and extracts the hour from each.
*/
SELECT
HOUR(TIMESTAMP(created_at)) AS event_create_hour,
COUNT(*) AS
event_count
FROM
[publicdata:samples.github_timeline]
GROUP
BY
event_create_hour
ORDER
BY
event_count
DESC; |
Read more about the available TIMESTAMP functions in our
query
reference guide.
Add columns to existing BigQuery tables
When working with large amounts of fast moving data, it's not uncommon to find out that you
need to add additional fields to your tables. In response to developer feedback, we have added
the ability to add new columns to existing BigQuery tables.
To take advantage of this feature, simply provide a new schema with additional columns using
either the "Tables: update" or "Tables: patch" BigQuery API methods.
For more information on this feature, visit the
BigQuery API
reference.
BigQuery Web UI: Dataset links and dataset sharing notifications
BigQuery has always provided project owners with very fine-grained control of how their
datasets are shared. To make it easier for teams to work on collaborative data analysis, we've
added direct links to individual datasets in the BigQuery Web UI. This provides a convenient
way for authorized users to quickly access a dataset, and allows for bookmarking and
sharing.
In addition, we've also added email notifications to inform users when they've been given
dataset access privileges. When a dataset has been shared with another user via the sharing
control panel, BigQuery sends a notification email containing a direct link to the
dataset.
The BigQuery UI features a collection of public datasets for you to use when trying out these
new features. To get started, visit our
sign up page and
Quick Start
guide. You should take a look at
our API docs, and ask questions
about BigQuery development on
Stack Overflow.
Finally, don't forget to give us feedback and join the discussion on our
Cloud Platform Developers Google+
page.
Michael
Manoochehri is a Developer Programs Engineer supporting the Google Cloud Platform.
His goal is to help make cloud computing and data analysis universally accessible and
useful.
Posted by Scott Knaster,
Editor