Gemini 1.5 Flash-8B is now production ready

OCT. 3, 2024

Logan Kilpatrick Group Product Manager

Shrestha Basu Mallick Product Google DeepMind

Today, Gemini 1.5 Flash-8B, our latest Flash variant, is production-ready and comes with:

50% lower price (compared to 1.5 Flash)

2x higher rate limits (compared to 1.5 Flash)

Lower latency on small prompts (compared to 1.5 Flash)

Developers can access gemini-1.5-flash-8b for free via Google AI Studio and the Gemini API.

Our lightweight model, smaller and faster

At I/O, we announced Gemini 1.5 Flash, our lightweight model, optimized for speed and efficiency. Over the last few months, Google DeepMind has made considerable progress making 1.5 Flash even better based on developer feedback and testing the limits of what’s possible.

Last month, we released an experimental version of Gemini 1.5 Flash-8B, a smaller and faster variant of 1.5 Flash. We’re now excited to make it generally available for production-use. Flash-8B nearly matches the performance of the 1.5 Flash model launched in May across many benchmarks. It performs especially well on tasks such as chat, transcription, and long context language translation.

Our release of best in class small models continues to be informed by developer feedback and our own testing of what is possible with these models. We see the most potential for this model in tasks ranging from high volume multimodal use cases to long context summarization tasks.

Performance chart of the 1.5 Flash model launched in May across many benchmark

Lowest cost per intelligence of any Gemini model

With the stable release of Gemini 1.5 Flash-8B, we are announcing the lowest cost per intelligence of any Gemini model:

$0.0375 per 1 million input tokens on prompts <128K

$0.15 per 1 million output tokens on prompts <128K

$0.01 per 1 million tokens on cached prompts <128K

For developers on the paid tier, billing will start on Monday October 14th.

This new price, along with the work we have already done to drive down developer costs with 1.5 Flash and 1.5 Pro, highlights our commitment to ensuring developers have the freedom to build the products and services that push the world forward.

A pricing table for the Gemini 1.5 Flash model, outlining the cost per one million tokens for input and output

2x higher rate limits for Flash-8B

Gemini 1.5 Flash-8B is best suited for simple, higher volume tasks. To make this model as useful as we can, we are doubling the 1.5 Flash-8B rate limits, meaning developers can send up to 4,000 requests per minute (RPM).

Happy building and stay tuned for more updates!

posted in: