Gemini 2.0: Level Up Your Apps with Real-Time Multimodal Interactions

12月 23, 2024
Ivan Solovyev Product Manager

Human-to-human communication is naturally multimodal, involving a mix of spoken words, visual cues, and real-time adjustments. With the Multimodal Live API for Gemini we've achieved this same level of naturalness in human-computer interaction. Imagine AI conversations that feel more interactive, where you can use visual inputs and receive context-aware solutions in real-time, seamlessly blending text, audio, and video. The Multimodal Live API for Gemini 2.0 enables this type of interaction and is available in Google AI Studio and Gemini API. This technology allows you to build applications that respond to the world as it happens, leveraging real-time data.


How it works

The Multimodal Live API is a stateful API utilizing WebSockets to facilitate low-latency, server-to-server communication. This API supports tools such as function calling, code execution, search grounding, and the combination of multiple tools within a single request, enabling comprehensive responses without the need for multiple prompts. This allows developers to create more efficient and complex AI interactions.

Key features of the Multimodal Live API include:

  • Bidirectional streaming: Allows for concurrent sending and receiving of text, audio and video data.

  • Sub-second latency: Outputs the first token in 600 milliseconds aligning reaction times with human expectation for seamless response.

  • Natural voice conversations: Supports human-like voice interactions, including the ability to interrupt and features like voice activity detection, enabling more fluid dialogue with AI.

  • Video understanding: Provides the ability to process and understand video input, enabling the model to combine both audio and video contexts for a more informed and nuanced response. This contextual awareness brings another layer of richness to the interaction.

  • Tool integration: Facilitates the integration of multiple tools within a single API call, extending the API's capabilities and allowing it to perform actions on behalf of the user to solve complex tasks.

  • Steerable voices: Offers a selection of five distinct voices with a high level of expressiveness, capable of conveying a wide spectrum of emotions. This allows for a more personalized and engaging user experience.


Multimodal live streaming in Action

The Multimodal Live API enables a variety of real-time, interactive applications. Here are a few examples of use cases where this API can be effectively applied:

  • Real-Time Virtual Assistants: Imagine an assistant that observes your screen and offers tailored advice in real-time, telling you where to find what you are looking for or executing actions or your behalf.

  • Adaptive Educational Tools: The API supports the development of educational applications that can adapt to a student's learning pace, for example, a language learning app could adjust the difficulty of exercises based on a student's real-time pronunciation and comprehension.

To help you explore this new functionality and kick start your own exploration we've created a bunch of demo applications showcasing realtime streaming capabilities:

A starter web application for streaming mic, camera or screen input. A perfect base for your creativity:

Link to Youtube Video (visible only when JS is disabled)

Full code and a getting started guide available on Github: https://github.com/google-gemini/multimodal-live-api-web-console.


Chat with Gemini about the weather. Select a location and have a gemini powered character explaining the weather in that location. You can interrupt and ask a follow up question anytime.

Link to Youtube Video (visible only when JS is disabled)

Getting Started with the Multimodal Live API

Ready to dive in? Experiment with Multimodal Live Streaming directly in Google AI Studio for a hands-on experience. Or, for full control, grab the detailed documentation and code samples to start building with the API today.

We've also partnered with Daily, to provide a seamless integration via their pipecat framework, enabling you to add real-time capabilities to your apps effortlessly. Daily.co, creators of the pipecat framework, is a video and audio API platform that makes it easy for developers to add real-time video and audio streaming to their websites and apps. Check out Daily's integration guide to get started building.

We're excited to see your creations - share your feedback and the amazing applications you build with the new API!