Breaking the SQL Barrier: Google BigQuery User-Defined Functions

AUG 25, 2015

Posted by, Thomas Park, Senior Software Engineer, Google BigQuery

Many types of computations can be difficult or impossible to express in SQL. Loops, complex conditionals, and non-trivial string parsing or transformations are all common examples. What can you do when you need to perform these operations but your data lives in a SQL-based Big data tool? Is it possible to retain the convenience and speed of keeping your data in a single system, when portions of your logic are a poor fit for SQL?

Google BigQuery is a fully managed, petabyte-scale data analytics service that uses SQL as its query interface. As part of our latest BigQuery release, we are announcing support for executing user-defined functions (UDFs) over your BigQuery data. This gives you the ability to combine the convenience and accessibility of SQL with the option to use a familiar programming language, JavaScript, when SQL isn’t the right tool for the job.

How does it work?

BigQuery UDFs are similar to map functions in MapReduce. They take one row of input and produce zero or more rows of output, potentially with a different schema.

Below is a simple example that performs URL decoding. Although BigQuery provides a number of built-in functions, it does not have a built-in for decoding URL-encoded strings. However, this functionality is available in JavaScript, so we can extend BigQuery with a simple User-Defined Function to decode this type of data:

      function decodeHelper(s) {
      try {
      return decodeURI(s);
      } catch (ex) {
      return s;
      }
      }

      // The UDF.
      function urlDecode(r, emit) {
      emit({title: decodeHelper(r.title),
      requests: r.num_requests});
      }

      

BigQuery UDFs are functions with two formal parameters. The first parameter is a variable to which each input row will be bound. The second parameter is an “emitter” function. Each time the emitter is invoked with a JavaScript object, that object will be returned as a row to the query.

In the above example, urlDecode is the UDF that will be invoked from BigQuery. It calls a helper function decodeHelper that uses JavaScript’s built-in decodeURI function to transform URL-encoded data into UTF-8.

Note the use of try / catch in decodeHelper. Data is sometimes dirty! If we encounter an error decoding a particular string for any reason, the helper returns the original, un-decoded string.

To make this function visible to BigQuery, it is necessary to include a registration call in your code that describes the function, including its input columns and output schema, and a name that you’ll use to reference the function in your SQL:

      bigquery.defineFunction(
      'urlDecode', // Name used to call the function from SQL.

      ['title', 'num_requests'], // Input column names.

      // JSON representation of output schema.
      [{name: 'title', type: 'string'},
      {name: 'requests', type: 'integer'}],

      urlDecode // The UDF reference.
      );

      

The UDF can then be invoked by the name “urlDecode” in the SQL query, with a source table or subquery as an argument. The following query looks for the most-visited French Wikipedia articles from April 2015 that contain a cédille character (ç) in the title:

      SELECT requests, title
      FROM
      urlDecode(
      SELECT
      title, sum(requests) AS num_requests
      FROM
      [fh-bigquery:wikipedia.pagecounts_201504]
      WHERE language = 'fr'
      GROUP EACH BY title
      )
      WHERE title LIKE '%ç%'
      ORDER BY requests DESC
      LIMIT 100

      

This query processes data from a 5.6 billion row / 380 Gb dataset and generally runs in less than two minutes. The cost? About $1.37, at the time of this writing.

To use a UDF in a query, it must be described via UserDefinedFunctionResource elements in your JobConfiguration request. UserDefinedFunctionResource elements can either contain inline JavaScript code or pointers to code files stored in Google Cloud Storage.

Under the hood

JavaScript UDFs are executed on instances of Google V8 running on Google servers. Your code runs close to your data in order to minimize added latency.

You don’t have to worry about provisioning hardware or managing pipelines to deal with data import / export. BigQuery automatically scales with the size of the data being queried in order to provide good query performance.

In addition, you only pay for what you use - there is no need to forecast usage or pre-purchase resources.

Developing your function

Interested in developing your JavaScript UDF without running up your BigQuery bill? Here is a simple browser-based widget that allows you to test and debug UDFs.

Note that not all JavaScript functionality supported in the browser is available in BigQuery. For example, anything related to the browser DOM is unsupported, including Window and Document objects, and any functions that require them, such as atob() / btoa().

Tips and tricks

Pre-filter input

In our URL-decoding example, we are passing a subquery as the input to urlDecode rather than the full table. Why?

There are about 5.6 billion rows in [fh-bigquery:wikipedia.pagecounts_201504]. However, one of the query predicates will filter the input data down to the rows where language is “fr” (French) - this is about 262 million rows. If we ran the UDF over the entire table and did the language and cédille filtering in a single WHERE clause, that would cause the JavaScript framework to process over 21 times more rows than it would with the filtered subquery. This equates to a lot of CPU cycles wasted doing unnecessary data conversion and marshalling.

If your input can easily be filtered down before invoking a UDF by using native SQL predicates, doing so will usually lead to a faster (and potentially cheaper) query.

Avoid persistent mutable state

You must not store and access mutable state across UDF execution for different rows. The following contrived example illustrates this error:

      // myCode.js
      var numRows = 0;

      function dontDoThis(r, emit) {
      emit(rowCount: ++numRows);
      }

      // The query.
      SELECT max(rowCount) FROM dontDoThis(myTable);

      

This is a problem because BigQuery will shard your query across multiple nodes, each of which has independent V8 instances and will therefore accumulate separate values for numRows.

Expand select *

You cannot execute SELECT * FROM urlDecode(...) at this time; you must explicitly list the columns being selected from the UDF: select requests, title from urlDecode(...)

For more information about BigQuery User-Defined Functions, see the full feature documentation.