Gemini 2.0: Level Up Your Apps with Real-Time Multimodal Interactions

DEC. 23, 2024

Ivan Solovyev Product Manager

Shrestha Basu Mallick Product Google DeepMind

Human-to-human communication is naturally multimodal, involving a mix of spoken words, visual cues, and real-time adjustments. With the Multimodal Live API for Gemini we've achieved this same level of naturalness in human-computer interaction. Imagine AI conversations that feel more interactive, where you can use visual inputs and receive context-aware solutions in real-time, seamlessly blending text, audio, and video. The Multimodal Live API for Gemini 2.0 enables this type of interaction and is available in Google AI Studio and Gemini API. This technology allows you to build applications that respond to the world as it happens, leveraging real-time data.

How it works

The Multimodal Live API is a stateful API utilizing WebSockets to facilitate low-latency, server-to-server communication. This API supports tools such as function calling, code execution, search grounding, and the combination of multiple tools within a single request, enabling comprehensive responses without the need for multiple prompts. This allows developers to create more efficient and complex AI interactions.

Key features of the Multimodal Live API include:

Bidirectional streaming: Allows for concurrent sending and receiving of text, audio and video data.

Sub-second latency: Outputs the first token in 600 milliseconds aligning reaction times with human expectation for seamless response.

Natural voice conversations: Supports human-like voice interactions, including the ability to interrupt and features like voice activity detection, enabling more fluid dialogue with AI.

Video understanding: Provides the ability to process and understand video input, enabling the model to combine both audio and video contexts for a more informed and nuanced response. This contextual awareness brings another layer of richness to the interaction.

Tool integration: Facilitates the integration of multiple tools within a single API call, extending the API's capabilities and allowing it to perform actions on behalf of the user to solve complex tasks.

Steerable voices: Offers a selection of five distinct voices with a high level of expressiveness, capable of conveying a wide spectrum of emotions. This allows for a more personalized and engaging user experience.

Multimodal live streaming in Action

The Multimodal Live API enables a variety of real-time, interactive applications. Here are a few examples of use cases where this API can be effectively applied:

Real-Time Virtual Assistants: Imagine an assistant that observes your screen and offers tailored advice in real-time, telling you where to find what you are looking for or executing actions or your behalf.

Adaptive Educational Tools: The API supports the development of educational applications that can adapt to a student's learning pace, for example, a language learning app could adjust the difficulty of exercises based on a student's real-time pronunciation and comprehension.

To help you explore this new functionality and kick start your own exploration we've created a bunch of demo applications showcasing realtime streaming capabilities:

A starter web application for streaming mic, camera or screen input. A perfect base for your creativity:

Link to Youtube Video (visible only when JS is disabled)

Full code and a getting started guide available on Github: https://github.com/google-gemini/multimodal-live-api-web-console.

Chat with Gemini about the weather. Select a location and have a gemini powered character explaining the weather in that location. You can interrupt and ask a follow up question anytime.

Link to Youtube Video (visible only when JS is disabled)

Getting Started with the Multimodal Live API

Ready to dive in? Experiment with Multimodal Live Streaming directly in Google AI Studio for a hands-on experience. Or, for full control, grab the detailed documentation and code samples to start building with the API today.

We've also partnered with Daily, to provide a seamless WebRTC SDK integration built with Pipecat, the open source framework. Pipecat's cross platform support lets you add real-time capabilities effortlessly to your web and native mobile apps. Daily, which provides SDKs and global infrastructure for ultra low latency voice, video, and AI, maintains Pipecat with contributions from the community. Check out Daily's integration guide to get started building.

We're excited to see your creations - share your feedback and the amazing applications you build with the new API!

posted in:

AI Cloud Tutorials Case Studies

Scaling Agentic RL: High-Throughput Agentic Training with Tunix

JULY 21, 2026

AI Case Studies How-To Guides

Run Ray on TPU, Part 2: Ray AI libraries

JULY 24, 2026

AI Cloud Tutorials Announcements

We terminated a TPU mid-training and it recovered in seconds: Introduction to elastic training with MaxText

JULY 6, 2026