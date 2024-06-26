While generative AI (gen AI) is rapidly growing in adoption, there’s a still largely-untapped potential to build products by applying gen AI to data that has higher requirements to ensure it remains private and confidential. For example, this could mean applying gen AI to: Data processing that enables personal assistants that are more fully integrated into and aware of what’s happening in our lives, and thus able to help us in a broader range of daily circumstances. Confidential business information, e.g., to automate tedious tasks such as processing invoices or handling customer support queries to improve productivity and lower the operational cost. In certain applications such as these, there may be heightened requirements with respect to privacy/confidentiality, transparency, and external verifiability of data processing. Google has developed a number of technologies that you can use to start experimenting with and exploring the potential of gen AI to process data that needs to stay more private. In this post, we’ll explain how you can use the recently released GenC open-source project to combine Confidential Computing, the Gemma open-source models, and mobile platforms together to begin experimenting with building your own gen AI-powered apps that can handle data with heightened requirements with respect to privacy/confidentiality, transparency, and external verifiability.

End-user devices and the cloud, working together The scenario we’ll focus on in this post, illustrated below, involves a mobile app that has access to data on device, and wants to perform gen AI processing on this data using an LLM. For example, imagine a personal assistant app that’s being asked to summarize or answer a question about notes, a document, or a recording stored on the device. The content might contain private information such as messages with another person, so we want to ensure it stays private. In our example, we picked the Gemma family of open-source models. Note that whereas we focus here on a mobile app, the same principles apply to businesses hosting their own data on-premises.

This example shows a “hybrid” setup that involves two LLMs, one running locally on the user’s device, and another hosted in a Google Cloud's Confidential Space Trusted Execution Environments (TEE) powered by Confidential Computing. This hybrid architecture enables the mobile app to take advantage of both on-device as well as cloud resources to benefit from the unique advantages of both: A smaller instance of quantized Gemma 2B that comes in a ~1.5GB package and fits on modern mobile devices (such as Pixel 7), where it can provide faster response times (without incurring network or data transfer latency), the ability to support queries even without a network connection, and a better cost-efficiency thanks to being able to take advantage of the local on-device hardware resources (and thus reach a broader audience for the same cost on the cloud side). A larger instance of unquantized Gemma 7B that comes just short of ~35GB that doesn’t fit even on high-powered devices. Since it’s hosted in the cloud, it depends on a network connection, and comes at a higher cost, but it offers better quality and the ability to handle more complex or expensive queries (with more resources available for processing), in addition to other benefits (such as minimizing the mobile device’s battery consumption thanks to offloading calculations to the cloud, etc.). In our example, the two models work together, connected into a model cascade in which the smaller, cheaper, and faster Gemma 2B serves as the first tier, and handles simpler queries, whereas the larger Gemma 7B serves as a backup for queries that the former can’t handle on its own. For example, in the code snippet further below, we setup Gemma 2B to act as an on-device router that first analyzes each input query to decide which of the two models is most appropriate, and then based on the outcome of that, either proceeds to handle the query locally on-device, or relays it to the Gemma 7B that resides in a cloud-based TEE.

TEE as a logical extension of the device You can think of the TEE in cloud in this architecture as effectively a logical extension of the user’s mobile device, powered by transparency, cryptographic guarantees, and trusted hardware: The private container with Gemma 7B and the GenC runtime hosted in the TEE runs with encrypted memory, the communication between the device and the TEE is encrypted as well, and no data is being persisted (but if need be, it could also be encrypted at rest). Before any interaction takes place, the device verifies the identity and integrity of the code in the TEE that will handle queries delegated from the device by requesting an attestation report, which includes a SHA256 digest of the container image that runs in the TEE. The device compares this digest against a digest bundled with the app by the developer. (Note that in this simple scenario, the user still trusts the app developer, just as they would with a purely on-device app; more complex setups are possible, but beyond the scope of this article.) All code that runs in the container image in this scenario is 100% open-source. Thus, either the developer, or any other external party can independently inspect the code that goes into the image to verify that it handles the data in a manner that matches user or data owner expectations, regulatory or contractual obligations, etc., and then proceed to build the image on their own, and to confirm that the resulting image digest matches the digest bundled within the app and expected by the app in the attestation report that’s subsequently returned by the TEE. At a glance this setup might seem complex, and indeed it would be such if one had to set it all up completely from scratch. We’ve developed GenC precisely to make the process easier.

Simplifying the developer experience Here’s the example of code you would actually have to write to set up a scenario like the above in GenC. We default here to Python as a popular choice, albeit we offer Java and C++ authoring APIs as well. In this example, we use the presence of a more sensitive subject as a signal that the query should be handled by a more powerful model (that is capable of crafting a more careful response). Keep in mind this example is simplified for illustration purposes. In practice, routing logic could be more elaborate and application-specific, and careful prompt engineering is essential to achieving good performance, especially with smaller models.

@genc . authoring . traced_computation def cascade ( x ): gemma_2b_on_device = genc . interop . llamacpp . model_inference ( '/device/llamacpp' , '/gemma-2b-it.gguf' , num_threads = 16 , max_tokens = 64 ) gemma_7b_in_a_tee = genc . authoring . confidential_computation [ genc . interop . llamacpp . model_inference ( '/device/llamacpp' , '/gemma-7b-it.gguf' , num_threads = 64 , max_tokens = 64 ), { 'server_address' : /* server address */ , 'image_digest' : /* image digest */ }] router = genc . authoring . serial_chain [ genc . authoring . prompt_template [ """Read the following input carefully: "{x}". Does it touch on political topics?""" ], gemma_2b_on_device , genc . authoring . regex_partial_match [ 'does touch|touches' ]] return genc . authoring . conditional [ gemma_2b_on_device ( x ), gemma_7b_in_a_tee ( x )]( router ( x ))