On-device small language models with multimodality, RAG, and Function Calling

MAY 20, 2025
Mark Sherwood Senior Product Manager
Matthew Chan Staff Software Engineer
Marissa Ikonomidis Staff Software Engineer

Last year Google AI Edge introduced support for on-device small language models (SLMs) with four initial models on Android, iOS, and Web. Today, we are excited to expand support to over a dozen models including the new Gemma 3 and Gemma 3n models, hosted on our new LiteRT Hugging Face community.

Gemma 3n, available via Google AI Edge as an early preview, is Gemma’s first multimodal on-device small language model supporting text, image, video, and audio inputs. Paired with our new Retrieval Augmented Generation (RAG) and Function Calling libraries, you have everything you need to prototype and build transformative AI features fully on the edge.

Let users control apps with on-device SLMs and our new function calling library

Broader model support

You can find our growing list of models to choose from in the LiteRT Hugging Face Community. Download any of these models and easily run them on-device with just a few lines of code. The models are fully optimized and converted for mobile and web. Full instructions on how to run these models can be found in our documentation and on each model card on Hugging Face.

To customize any of these models, you finetune the base model and then convert and quantize the model using the appropriate AI Edge libraries. We have a Colab showing every step you need to fine-tune and then convert Gemma 3 1B.

With the latest release of our quantization tools, we have new quantization schemes that allow for much higher quality int4 post training quantization. Compared to bf16, the default data type for many models, int4 quantization can reduce the size of language models by a factor of 2.5-4X while significantly decreasing latency and peak memory consumption.


Gemma 3 1B & Gemma 3n

Earlier this year, we introduced Gemma 3 1B. At only 529MB, this model can run up to 2,585 tokens per second pre-fill on the mobile GPU, allowing it to process up to a page of content in under a second. Gemma 3 1B’s small footprint allows it to support a wide range of devices and limits the size of files an end user would need to download in their application.

Today, we are thrilled to add an early preview of Gemma 3n to our collection of supported models. The 2B and 4B parameter variants will both support native text, image, video, and audio inputs. The text and image modalities are available on Hugging Face with audio to follow shortly.

Gemma 3n analyzing images fully on-device

Gemma 3n is great for enterprise use cases where developers have the full resources of the device available to them, allowing for larger models on mobile. Field technicians with no service could snap a photo of a part and ask a question. Workers in a warehouse or a kitchen could update inventory via voice while their hands were full.


Bringing context to conversations: On-device Retrieval Augmented Generation (RAG)

One of the most exciting new capabilities we're bringing to Google AI Edge is robust support for on-device Retrieval Augmented Generation (RAG). RAG allows you to augment your small language model with data specific to your application, without the need for fine-tuning. From 1000 pages of information or 1000 photos, RAG can help find just the most relevant few pieces of data to feed to your model.

The AI Edge RAG library works with any of our supported small language models. Furthermore it offers the flexibility to change any part of the RAG pipeline enabling custom databases, chunking methods, and retrieval functions. The AI Edge RAG library is available today on Android with more platforms to follow. This means your on-device generative AI applications can now be grounded in specific, user-relevant information, unlocking a new class of intelligent features.


Enabling action: On-device function calling

To make on-device language models truly interactive, we're introducing on-device function calling. The AI Edge Function Calling library is available on Android today with more platforms to follow. The library includes all of the utilities you need to integrate with an on-device language model, register your application functions, parse the response, and call your functions. Check out the documentation to try it yourself.

This powerful feature enables your language models to intelligently decide when to call predefined functions or APIs within your application. For example, in our sample app, we demonstrate how function calling can be used to fill out a form through natural language. In the context of a medical app asking for pre-appointment patient history, the user dictates their personal information. With our function calling library and an on-device language model, the app converts the voice to text, extracts the relevant information, and then calls application specific functions to fill out the individual fields.

The function calling library can also be paired with our python tool simulation library. The tool simulation library aids you in creating a custom language model for your specific functions through synthetic data generation and evaluation, increasing the accuracy of function calling on-device.


What’s next

We will continue to support the latest and greatest small language models on the edge, including new modalities. Keep an eye on our LiteRT Hugging Face Community for new model releases. Our RAG and function calling libraries will continue to expand in functionality and supported platforms.

For more Google AI Edge news, read about the new LiteRT APIs and our new AI Edge Portal service for broad coverage on-device benchmarking and evals.

Explore this announcement and all Google I/O 2025 updates on io.google starting May 22.


Acknowledgements

We also want to thank the following Googlers for their support in these launches: Advait Jain, Akshat Sharma, Alan Kelly, Andrei Kulik, Byungchul Kim, Chunlei Niu, Chun-nien Chan, Chuo-Ling Chang, Claudio Basile, Cormac Brick, Ekaterina Ignasheva, Eric Yang, Fengwu Yao, Frank Ban, Gerardo Carranza, Grant Jensen, Haoliang Zhang, Henry Wang, Ho Ko, Ivan Grishchenko, Jae Yoo, Jingjiang Li, Jiuqiang Tang, Juhyun Lee, Jun Jiang, Kris Tonthat, Lin Chen, Lu Wang, Marissa Ikonomidis, Matthew Soulanille, Matthias Grundmann, Milen Ferev, Mogan Shieh, Mohammadreza Heydary, Na Li, Pauline Sho, Pedro Gonnet, Ping Yu, Pulkit Bhuwalka, Quentin Khan, Ram Iyengar, Raman Sarokin, Rishika Sinha, Ronghui Zhu, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Suleman Shahid, T.J. Alumbaugh, Tenghui Zhu, Terry (Woncheol) Heo, Tyler Mullen, Vitalii Dziuba, Wai Hon Law, Weiyi Wang, Xu Chen, Yi-Chun Kuo, Yishuang Pang, Youchuan Hu, Yu-hui Chen, Zichuan Wei