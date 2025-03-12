Gemma 3 1B is a new model size in the Gemma family of open weight models that truly opens the possibility for distributing in-app small language models (SLMs) across mobile and web. When deploying SLMs in production settings, models need to be small enough to download quickly, run fast enough to hold user attention, and support a wide range of end user devices. At only 529MB in size, Gemma 3 1B runs at up to 2585 tok/sec on prefill via Google AI Edge’s LLM inference, creating the ability to process a page of content in under a second. Including Gemma 3 1B in your app, you can use natural language to drive your application or generate content from in-app data or context, all fully customizable and fine-tunable. In this post, we'll guide you through some example use cases for Gemma 3 in your application, how to get started with Gemma on Android, dive into some of the performance metrics, and explain how all of this was achieved.

What Can I Do With Gemma 3 in My App? With a fully on-device Gemma 3 1B model, you are able to take advantage of the benefits of AI Edge: Offline Availability: Enable your app to work fully when WiFi or cellular data is unavailable. 2. Cost: With no cloud bills, enable free or freemium apps. 3. Latency: Some features need to be faster than a server call allows. 4. Privacy: Bring intelligence to data that is unable to leave the device or is end-to-end encrypted.

Gemma 1B is extremely versatile and can even be fine-tuned for your own domain and use cases. Here are just a few of our favorite use cases for Gemma 1B: Data Captioning: Turn your app data into engaging and shareable descriptions, i.e, Sleep Data -> “You slept well for 7 hours but you stirred awake 5 times between 2am and 4am”. 2. In-Game Dialog: Create NPC dialog based on the current game state. 3. Smart Reply: Provide users with intelligent conversation-aware suggested responses while messaging. 4. Document Q&A: Use Gemma 3 along with our new AI Edge RAG SDK to ingest long documents and answer user questions.

Getting started Step 1: Load the Demo app Download Google AI Edge’s pre-built demo app from GitHub and push it to your local Android device. For best performance with Gemma 3 1B, we recommend a device with at least 4GB of memory.

$ wget https://github.com/google-ai-edge/mediapipe-samples/releases/download/v0.1.3/llm_inference_v0.1.3-debug.apk $ adb install llm_inference_v0.1.3-debug.apk

Alternatively, you can follow our instructions to build the app from source.

Step 2: Select CPU or GPU The Gemma 3 model file offers great deployment flexibility, running seamlessly on either your device's CPU or mobile GPU. You can choose to run Gemma 3 on CPU or GPU when you first start the app, or switch between models and backends by going back to the model selection dialog.

Step 3: Download the Model from Hugging Face On the model selection screen in the demo app, choose your model. The app will direct you to Hugging Face to login and accept the Gemma terms of use. Gemma 3 1B, quantized at int4, will be downloaded directly from the LiteRT HuggingFace community organization, and will then be optimized once to run on your device (but this only takes a few seconds!).

Step 4: Run the Model Now it's time to put Gemma 3 to work! Under the hood, Gemma 3 is powered by Google AI Edge’s LLM Inference API, designed for efficient on-device processing. You can interact with the model by chatting with it. Or, you can give it other text processing tasks. For example, try the following: Copy a few paragraphs from a blog post (like this one) or an article. Switch over to the LLM Demo app. Paste the copied text into the input box. Type "Create a social media post for this content. Keep it short and sweet. Less than 50 words" and press enter. Step 5: Customize Gemma 3 (optional) One of the great things about the Gemma family of open weight models are the fine-tuned versions produced by the modeling community. Follow this Colab to see how you can use your own data to create your own version of Gemma 3 1B, quantize it, and get it running on mobile devices (CPU and GPU) in your own applications!

Performance

The demo and measurements here are for the Gemma 3 1B model with int4 parameters quantized via quantized-aware training (QAT) which provides significant storage savings and increased decode throughput. The benchmarked Gemma 3 model supports multiple prefill lengths of 32, 128, 512 and 1024 and it uses a context length of 2048.

Measurements were taken on an Android Samsung Galaxy S24 Ultra with cpufreq governor set to performance. Observed performance may vary depending on your phone’s hardware and current activity level.

Measurements were taken on MacBook Pro 2023 (Apple M3 Pro chip) Observed performance may vary depending on your computer’s hardware and current activity level.