Enabling more private generative AI

JULY 11, 2024

Krzysztof Ostrowski Research Scientist

Rene Kolga Senior Product Manager

While generative AI (gen AI) is rapidly growing in adoption, there’s a still largely-untapped potential to build products by applying gen AI to data that has higher requirements to ensure it remains private and confidential.

For example, this could mean applying gen AI to:

Data processing that enables personal assistants that are more fully integrated into and aware of what’s happening in our lives, and thus able to help us in a broader range of daily circumstances.

Confidential business information, e.g., to automate tedious tasks such as processing invoices or handling customer support queries to improve productivity and lower the operational cost.

In certain applications such as these, there may be heightened requirements with respect to privacy/confidentiality, transparency, and external verifiability of data processing.

Google has developed a number of technologies that you can use to start experimenting with and exploring the potential of gen AI to process data that needs to stay more private. In this post, we’ll explain how you can use the recently released GenC open-source project to combine Confidential Computing, the Gemma open-source models, and mobile platforms together to begin experimenting with building your own gen AI-powered apps that can handle data with heightened requirements with respect to privacy/confidentiality, transparency, and external verifiability.

End-user devices and the cloud, working together

The scenario we’ll focus on in this post, illustrated below, involves a mobile app that has access to data on device, and wants to perform gen AI processing on this data using an LLM.

For example, imagine a personal assistant app that’s being asked to summarize or answer a question about notes, a document, or a recording stored on the device. The content might contain private information such as messages with another person, so we want to ensure it stays private.

In our example, we picked the Gemma family of open-source models. Note that whereas we focus here on a mobile app, the same principles apply to businesses hosting their own data on-premises.

A model cascade that spans from a mobile device to a trusted enclave in cloud.

This example shows a “hybrid” setup that involves two LLMs, one running locally on the user’s device, and another hosted in a Google Cloud's Confidential Space Trusted Execution Environments (TEE) powered by Confidential Computing. This hybrid architecture enables the mobile app to take advantage of both on-device as well as cloud resources to benefit from the unique advantages of both:

A smaller instance of quantized Gemma 2B that comes in a ~1.5GB package and fits on modern mobile devices (such as Pixel 7), where it can provide faster response times (without incurring network or data transfer latency), the ability to support queries even without a network connection, and a better cost-efficiency thanks to being able to take advantage of the local on-device hardware resources (and thus reach a broader audience for the same cost on the cloud side).

A larger instance of unquantized Gemma 7B that comes just short of ~35GB that doesn’t fit even on high-powered devices. Since it’s hosted in the cloud, it depends on a network connection, and comes at a higher cost, but it offers better quality and the ability to handle more complex or expensive queries (with more resources available for processing), in addition to other benefits (such as minimizing the mobile device’s battery consumption thanks to offloading calculations to the cloud, etc.).

In our example, the two models work together, connected into a model cascade in which the smaller, cheaper, and faster Gemma 2B serves as the first tier, and handles simpler queries, whereas the larger Gemma 7B serves as a backup for queries that the former can’t handle on its own. For example, in the code snippet further below, we setup Gemma 2B to act as an on-device router that first analyzes each input query to decide which of the two models is most appropriate, and then based on the outcome of that, either proceeds to handle the query locally on-device, or relays it to the Gemma 7B that resides in a cloud-based TEE.

TEE as a logical extension of the device

You can think of the TEE in cloud in this architecture as effectively a logical extension of the user’s mobile device, powered by transparency, cryptographic guarantees, and trusted hardware:

The private container with Gemma 7B and the GenC runtime hosted in the TEE runs with encrypted memory, the communication between the device and the TEE is encrypted as well, and no data is being persisted (but if need be, it could also be encrypted at rest).

Before any interaction takes place, the device verifies the identity and integrity of the code in the TEE that will handle queries delegated from the device by requesting an attestation report, which includes a SHA256 digest of the container image that runs in the TEE. The device compares this digest against a digest bundled with the app by the developer. (Note that in this simple scenario, the user still trusts the app developer, just as they would with a purely on-device app; more complex setups are possible, but beyond the scope of this article.)

All code that runs in the container image in this scenario is 100% open-source. Thus, either the developer, or any other external party can independently inspect the code that goes into the image to verify that it handles the data in a manner that matches user or data owner expectations, regulatory or contractual obligations, etc., and then proceed to build the image on their own, and to confirm that the resulting image digest matches the digest bundled within the app and expected by the app in the attestation report that’s subsequently returned by the TEE.

At a glance this setup might seem complex, and indeed it would be such if one had to set it all up completely from scratch. We’ve developed GenC precisely to make the process easier.

Simplifying the developer experience

Here’s the example of code you would actually have to write to set up a scenario like the above in GenC. We default here to Python as a popular choice, albeit we offer Java and C++ authoring APIs as well. In this example, we use the presence of a more sensitive subject as a signal that the query should be handled by a more powerful model (that is capable of crafting a more careful response). Keep in mind this example is simplified for illustration purposes. In practice, routing logic could be more elaborate and application-specific, and careful prompt engineering is essential to achieving good performance, especially with smaller models.

@genc.authoring.traced_computation
def cascade(x):
  gemma_2b_on_device = genc.interop.llamacpp.model_inference(
    '/device/llamacpp', '/gemma-2b-it.gguf', num_threads=16, max_tokens=64)

  gemma_7b_in_a_tee = genc.authoring.confidential_computation[
    genc.interop.llamacpp.model_inference(
      '/device/llamacpp', '/gemma-7b-it.gguf', num_threads=64, max_tokens=64),
    {'server_address': /* server address */, 'image_digest': /* image digest */ }]

  router = genc.authoring.serial_chain[
    genc.authoring.prompt_template[
      """Read the following input carefully: "{x}".
      Does it touch on political topics?"""],
    gemma_2b_on_device,
    genc.authoring.regex_partial_match['does touch|touches']]

  return genc.authoring.conditional[
    gemma_2b_on_device(x), gemma_7b_in_a_tee(x)](router(x))

Python

You can see detailed step-by-step breakdown of how to build and run such examples in our tutorials on GitHub. As you can see, the level of abstraction matches what you can find in popular SDKs such as LangChain. Model inference calls to Gemma 2B and 7B are interspersed here with prompt templates and output parsers, and combined into chains. (By the way, we do offer limited LangChain interop that we hope to expand.)

Note that whereas the Gemma 2B model inference call is used directly within a chain that runs on-device, the Gemma 7B call is explicitly embedded within a confidential_computation statement.

The point is that there are no surprises here - the programmer is always in full control of the decision of what processing to perform on-device, and what to delegate from device to a TEE in the cloud. This decision is explicitly reflected in the code structure. (please note whereas in this example, we only delegate the Gemma 7B calls to a single trusted backend, the mechanism we provide is generic, and one can use it to delegate larger chunks of processing, e.g., an entire agent loop, to an arbitrary number of backends.)

From prototyping to flexible deployment

Whereas the code shown above is expressed using a familiar Python syntax, under the hood it’s being transformed into what we call a portable platform- and language-independent form that we refer to as the Intermediate Representation (or “IR” for short).

This approach offers a number of advantages; to name a few:

It enables you to prototype and test your gen AI logic in an easy-to-use rapid development environment that supports fast-paced iteration, such as a Jupyter notebook, and then deploy the same gen AI code with minimal to no changes to run, e.g., in a Java app on a mobile device. In our tutorials, this is as simple as copying a file containing the IR to your mobile device and loading it in your app.

It enables you to deploy and run the same logic, with consistent behavior across languages and platforms (e.g., from Linux-based to mobile platforms, from Python to Java and C++). This is a win if you plan to target a number of different product surfaces.

It enables you to dynamically delegate any part of the gen AI logic across process and machine boundaries. This is implicitly what’s happening in our scenario, with the mobile device delegating to a TEE in the cloud. It just so happens that in this simple example, we’re only delegating a single operation (the Gemma 7B inference call). The mechanism we offer is considerably more general.

In realistic deployments, performance is often a critical factor. Our published examples at the moment are limited to CPU-only, and GenC currently only offers llama.cpp as the driver for models in a TEE. However, the Confidential Computing team is extending support to Intel TDX with Intel AMX built-in accelerator along with the upcoming preview of Nvidia H100 GPUs running in confidential mode, and we are actively working to expand the range of the available software and hardware options to unlock the best performance and support for a broader range of models - stay tuned for the future updates!

We’d love to hear from you!

We hope that you’re intrigued, and that this post will encourage you to experiment with building your own gen AI applications using some of the technologies we’ve introduced. And on that note, please do keep in mind that GenC is an experimental framework, developed for experimental and research purposes - we’ve built it to demonstrate what’s possible, and to inspire you to explore this exciting space together with us. If you’d like to contribute - please reach out to the authors, or simply engage with us on GitHub. We love to collaborate!