Posted by, Thomas Park, Senior Software Engineer, Google BigQuery
Many types of computations can be difficult or impossible to express in SQL. Loops, complex conditionals, and non-trivial string parsing or transformations are all common examples. What can you do when you need to perform these operations but your data lives in a SQL-based Big data tool? Is it possible to retain the convenience and speed of keeping your data in a single system, when portions of your logic are a poor fit for SQL?
Google BigQuery is a fully managed, petabyte-scale data analytics service that uses SQL as its query interface. As part of our latest BigQuery release, we are announcing support for executing user-defined functions (UDFs) over your BigQuery data. This gives you the ability to combine the convenience and accessibility of SQL with the option to use a familiar programming language, JavaScript, when SQL isn’t the right tool for the job.
BigQuery UDFs are similar to map functions in MapReduce. They take one row of input and produce zero or more rows of output, potentially with a different schema.
Below is a simple example that performs URL decoding. Although BigQuery provides a number of built-in functions, it does not have a built-in for decoding URL-encoded strings. However, this functionality is available in JavaScript, so we can extend BigQuery with a simple User-Defined Function to decode this type of data:
function decodeHelper(s) { try { return decodeURI(s); } catch (ex) { return s; } } // The UDF. function urlDecode(r, emit) { emit({title: decodeHelper(r.title), requests: r.num_requests}); }
BigQuery UDFs are functions with two formal parameters. The first parameter is a variable to which each input row will be bound. The second parameter is an “emitter” function. Each time the emitter is invoked with a JavaScript object, that object will be returned as a row to the query.
In the above example, urlDecode
is the UDF that will be
invoked from BigQuery. It calls a helper function decodeHelper
that
uses JavaScript’s built-in decodeURI
function to transform
URL-encoded data into UTF-8.
Note the use of try / catch
in
decodeHelper
. Data is sometimes dirty! If we encounter an error
decoding a particular string for any reason, the helper returns the original, un-decoded
string.
To make this function visible to BigQuery, it is necessary to include a registration call in your code that describes the function, including its input columns and output schema, and a name that you’ll use to reference the function in your SQL:
bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL. ['title', 'num_requests'], // Input column names. // JSON representation of output schema. [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], urlDecode // The UDF reference. );
The UDF can then be invoked by the name “urlDecode” in the SQL query, with a source
table or subquery as an argument. The following query looks for the most-visited French
Wikipedia articles from April 2015 that contain a cédille character
(ç
) in the title:
SELECT requests, title FROM urlDecode( SELECT title, sum(requests) AS num_requests FROM [fh-bigquery:wikipedia.pagecounts_201504] WHERE language = 'fr' GROUP EACH BY title ) WHERE title LIKE '%ç%' ORDER BY requests DESC LIMIT 100
This query processes data from a 5.6 billion row / 380 Gb dataset and generally runs in less than two minutes. The cost? About $1.37, at the time of this writing.
To use a UDF in a query, it must be described via UserDefinedFunctionResource
elements in your JobConfiguration
request.
UserDefinedFunctionResource
elements can either contain inline
JavaScript code or pointers
to code files stored in Google
Cloud Storage.
JavaScript UDFs are executed on instances of Google V8 running on Google servers. Your code runs close to your data in order to minimize added latency.
You don’t have to worry about provisioning hardware or managing pipelines to deal with data import / export. BigQuery automatically scales with the size of the data being queried in order to provide good query performance.
In addition, you only pay for what you use - there is no need to forecast usage or pre-purchase resources.
Interested in developing your JavaScript UDF without running up your BigQuery bill? Here is a simple browser-based widget that allows you to test and debug UDFs.
Note that not all JavaScript functionality supported in the browser is available in
BigQuery. For example, anything related to the browser DOM is unsupported, including
Window
and Document
objects, and any
functions that require them, such as atob() / btoa()
.
Pre-filter input
In our URL-decoding example, we are passing a subquery as the input to
urlDecode
rather than the full table. Why?
There are about 5.6 billion rows in
[fh-bigquery:wikipedia.pagecounts_201504]
. However, one of the query
predicates will filter the input data down to the rows where language is “fr” (French) - this
is about 262 million rows. If we ran the UDF over the entire table and did the language and
cédille filtering in a single WHERE clause, that would cause the JavaScript framework to
process over 21 times more rows than it would with the filtered subquery.
This equates to a lot of CPU cycles wasted doing unnecessary data conversion and marshalling.
If your input can easily be filtered down before invoking a UDF by using native SQL predicates, doing so will usually lead to a faster (and potentially cheaper) query.
Avoid persistent mutable state
You must not store and access mutable state across UDF execution for different rows. The following contrived example illustrates this error:
// myCode.js var numRows = 0; function dontDoThis(r, emit) { emit(rowCount: ++numRows); } // The query. SELECT max(rowCount) FROM dontDoThis(myTable);
This is a problem because BigQuery will shard your query across multiple nodes, each
of which has independent V8 instances and will therefore accumulate separate values for
numRows
.
Expand select
*
You cannot execute SELECT * FROM urlDecode(...)
at this
time; you must explicitly list the columns being selected from the UDF: select
requests, title from urlDecode(...)
For more information about BigQuery User-Defined Functions, see the full feature documentation.