Gemma 3 1B is a new model size in the Gemma family of open weight models that truly opens the possibility for distributing in-app small language models (SLMs) across mobile and web. When deploying SLMs in production settings, models need to be small enough to download quickly, run fast enough to hold user attention, and support a wide range of end user devices.
At only 529MB in size, Gemma 3 1B runs at up to 2585 tok/sec on prefill via Google AI Edge’s LLM inference, creating the ability to process a page of content in under a second. Including Gemma 3 1B in your app, you can use natural language to drive your application or generate content from in-app data or context, all fully customizable and fine-tunable.
In this post, we'll guide you through some example use cases for Gemma 3 in your application, how to get started with Gemma on Android, dive into some of the performance metrics, and explain how all of this was achieved.
With a fully on-device Gemma 3 1B model, you are able to take advantage of the benefits of AI Edge:
2. Cost: With no cloud bills, enable free or freemium apps.
3. Latency: Some features need to be faster than a server call allows.
4. Privacy: Bring intelligence to data that is unable to leave the device or is end-to-end encrypted.
Gemma 1B is extremely versatile and can even be fine-tuned for your own domain and use cases. Here are just a few of our favorite use cases for Gemma 1B:
2. In-Game Dialog: Create NPC dialog based on the current game state.
3. Smart Reply: Provide users with intelligent conversation-aware suggested responses while messaging.
4. Document Q&A: Use Gemma 3 along with our new AI Edge RAG SDK to ingest long documents and answer user questions.
Step 1: Load the Demo app
Download Google AI Edge’s pre-built demo app from GitHub and push it to your local Android device. For best performance with Gemma 3 1B, we recommend a device with at least 4GB of memory.
$ wget https://github.com/google-ai-edge/mediapipe-samples/releases/download/v0.1.3/llm_inference_v0.1.3-debug.apk
$ adb install llm_inference_v0.1.3-debug.apk
Alternatively, you can follow our instructions to build the app from source.
The Gemma 3 model file offers great deployment flexibility, running seamlessly on either your device's CPU or mobile GPU. You can choose to run Gemma 3 on CPU or GPU when you first start the app, or switch between models and backends by going back to the model selection dialog.
On the model selection screen in the demo app, choose your model. The app will direct you to Hugging Face to login and accept the Gemma terms of use. Gemma 3 1B, quantized at int4, will be downloaded directly from the LiteRT HuggingFace community organization, and will then be optimized once to run on your device (but this only takes a few seconds!).
Now it's time to put Gemma 3 to work! Under the hood, Gemma 3 is powered by Google AI Edge’s LLM Inference API, designed for efficient on-device processing.
You can interact with the model by chatting with it. Or, you can give it other text processing tasks. For example, try the following:
One of the great things about the Gemma family of open weight models are the fine-tuned versions produced by the modeling community. Follow this Colab to see how you can use your own data to create your own version of Gemma 3 1B, quantize it, and get it running on mobile devices (CPU and GPU) in your own applications!
The demo and measurements here are for the Gemma 3 1B model with int4 parameters quantized via quantized-aware training (QAT) which provides significant storage savings and increased decode throughput. The benchmarked Gemma 3 model supports multiple prefill lengths of 32, 128, 512 and 1024 and it uses a context length of 2048.
The performance results described above were achieved through extensive optimization efforts. These optimizations were designed to work well across open weight models, including Gemma. Here are some key features that significantly boosted performance and enabled new, reusable functionality.
Quantization: Quantization-aware training was applied to Gemma using a 4-bit integer channel-wise scheme on weights to maintain optimal performance, model quality, and size. In addition to weight quantization, we also dynamically quantize the activation to int8 during execution to best utilize CPU capability.
Updating the KV Cache layouts: The KV cache is used in Transformer based models to store the key-value pairs from previous steps so they can be used to generate subsequent tokens. Reads and writes to the KV cache happen frequently so it is important that these operations are efficient. These operations were optimized by introducing a KV Cache layout to reduce extra transposes and reshapes. This optimization improved latency on Gemma models by approximately 25% for CPU and 20% for GPU. An extra operation was also added to more to performantly update the KV cache in-place on the GPU.
Improved Loading Time: To make the most of CPU and GPU processing, we use specialized tensor layouts. Generating these optimized weight layouts can take time, power and significant memory. During the first model load, the weights are cached on disk in their optimized format and subsequent loads read from the cache. If tensor layouts are further optimized, the existing cache will automatically be invalidated and the new format will be stored on disk during the next model load.
GPU Weight Sharing: The LLM inference process has two phases: prefill and decode. These phases typically use separate resources for their respective models. To dramatically reduce the memory footprint of LLMs, both phases can share the same weights. While this technique isn't entirely new, this is the first time it has been done in an easily reusable way in the LiteRT Runtime and GPU Delegate. For ops that support this feature, the GPU delegate checks if the weights are already present in GPU memory and can be reused. In the future, other models will be able to trivially take advantage of this capability.
During the development of Gemma 3, we focused on delivering excellent performance while also building reusable infrastructure for open weight models. In 2025, we plan to leverage this work to support a wider set of third-party models. With additional performance optimizations and an emphasis on further reducing memory use, we intend to continue making models more accessible on a wider range of devices. To keep up with the latest developments, set up notifications for ai_edge_torch on GitHub. More to come soon!
Advait Jain, Akshat Sharma, Alan Kelly, Andrei Kulik, Byungchul Kim, Chunlei Niu, Chun-nien Chan, Chuo-Ling Chang, Claudio Basile, Cormac Brick, Ekaterina Ignasheva, Eric Yang, Fengwu Yao, Frank Ban, Gerardo Carranza, Grant Jensen, Haoliang Zhang, Henry Wang, Ho Ko, Jae Yoo, Jiuqiang Tang, Juhyun Lee, Jun Jiang, Khanh LeViet, Kris Tonthat, Lin Chen, Lu Wang, Malini P V, Marissa Ikonomidis, Mark Sherwood, Matthew Soulanille, Matthias Grundmann, Mogan Shieh, Mohammadreza Heydary, Na Li, Pauline Sho, Pedro Gonnet, Ping Yu, Pulkit Bhuwalka, Quentin Khan, Ram Iyengar, Raman Sarokin, Rishika Sinha, Rishubh Khurana, Ronghui Zhu, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Suleman Shahid, T.J. Alumbaugh, Tenghui Zhu, Terry (Woncheol) Heo, Tyler Mullen, Vamsi Manchala, Vitalii Dziuba, Wai Hon Law, Weiyi Wang, Xu Chen, Yishuang Pang, Youchuan Hu, Yu-hui Chen, Zichuan Wei