LiteRT: The Universal Framework for On-Device AI

JAN. 28, 2026

Lu Wang Software Engineer

Chintan Parikh Product Manager

Jingjiang Li Software Engineer

Terry Heo Software Engineer

Since we first introduced LiteRT in 2024, we have focused on evolving our ML tech stack from its TensorFlow Lite (TFLite) foundation into a modern on-device AI framework. While TFLite set the standard for classical ML, our mission is to empower developers to deploy today’s cutting-edge AI on-device just as seamlessly as they integrated classical ML in the past.

At Google I/O ‘25, we shared a preview of this evolution: a high-performance runtime designed specifically for advanced hardware acceleration. Today, we are excited to announce that these advanced acceleration capabilities have fully graduated into the LiteRT production stack, available now for all developers.

This milestone solidifies LiteRT as the universal on-device inference framework for the AI era, representing a significant leap over TFLite for being:

Faster: delivers 1.4x faster GPU performance than TFLite, and introduces new, state-of-the-art NPU acceleration.
Simpler: provides a unified, streamlined workflow for GPU and NPU acceleration across edge platforms.
Powerful: supports superior cross-platform GenAI deployment for popular open models like Gemma.
Flexible: offers first-class PyTorch/JAX support via seamless model conversion.

All of this is delivered while maintaining the same reliable, cross-platform deployment you trust since TFLite.

Here is how LiteRT empowers you in building the next-generation of on-device AI.

High-performance cross-platform GPU acceleration

Moving beyond the initial GPU acceleration on Android announced at I/O ‘25, we are excited to introduce the full, comprehensive GPU support across Android, iOS, macOS, Windows, Linux, and Web. This expansion provides developers with a reliable, high-performance acceleration option that scales significantly beyond classical CPU inference.

LiteRT maximizes the reach by introducing robust support for OpenCL, OpenGL, Metal, and WebGPU, via ML Drift, our next-generation GPU engine, allowing you to deploy models efficiently across mobile, desktop, and web. On Android, LiteRT optimizes this further by automatically prioritizing OpenCL when available for peak performance, while falling back to OpenGL for broader device coverage.

Empowered by ML Drift, LiteRT GPU has achieved a significant leap in efficiency, delivering substantial performance gains that average 1.4x faster over the legacy TFLite GPU delegate, significantly reducing latency across a broad range of models. See more benchmark results in our previous announcement.

To enable high-performance AI applications, we have also introduced key technical advancements to optimize end-to-end latency, specifically asynchronous execution and zero-copy buffer interoperability. These features significantly reduce unnecessary CPU overhead and boost overall performance, fulfilling the stringent requirements for real-time use cases like background segmentation and speech recognition (ASR). In practice, these optimizations can result in up to 2x faster performance, as demonstrated in our Segmentation sample app. For a closer look at the improvements, see our technical deep dive.

The following examples demonstrate how easily you can leverage GPU acceleration with the new CompiledModel API in C++:

// 1. Create a compiled model targeting GPU in C++.
auto compiled_model = CompiledModel::Create(env, "mymodel.tflite", 
kLiteRtHwAcceleratorGpu);

// 2. Create an input TensorBuffer that wraps the OpenGL buffer (i.e. from 
image pre-processing) with zero-copy.
auto input_buffer = TensorBuffer::CreateFromGlBuffer(env, tensor_type, 
opengl_buffer);
std::vector<TensorBuffer> input_buffers{input_buffer};
auto output_buffers = compiled_model.CreateOutputBuffers();

// 3. Execute the model.
compiled_model.Run(inputs, outputs);

// 4. Access model output, i.e. AHardwareBuffer.
auto ahwb = output_buffer[0]->GetAhwb();

C++

See more instructions on LiteRT cross-platform development and GPU acceleration from LiteRT DevSite.

Streamlined NPU integration with peak performance

While CPU and GPU offer broad versatility for AI tasks, the NPU is the key to unlock the smooth, responsive, and high-speed AI experience that modern applications demand. However, fragmentation across hundreds of NPU SoC variants often forces developers to navigate a maze of disparate compilers and runtimes. Furthermore, because traditional ML infrastructure has historically lacked deep integration with specialized NPU SDKs, the result has been complex, ad-hoc deployment workflows that are difficult to manage in production.

LiteRT addresses these challenges by providing a unified, simplified NPU deployment workflow that abstracts away low-level, vendor-specific SDKs and handles fragmentation across numerous SoC variants. We have streamlined this into a simple, three-step process to get your models running with NPU acceleration easily:

AOT Compilation for the target SoCs (optional): Use the LiteRT Python library to pre-compile your .tflite model for target SoCs.
Deploy with Google Play for On-device AI (PODAI) if on Android: Leverage PODAI to automatically deliver the model and runtime to a compatible device.
Inference using LiteRT Runtime: LiteRT handles NPU delegation and provides robust fallback to GPU or CPU if needed.

For a full, detailed guide, including colab and sample apps, visit our LiteRT NPU documentation.

To provide flexible integration options that fit your specific deployment needs, LiteRT offers both ahead-of-time (AOT) and on-device (JIT) compilation. This allows you to choose the best strategy based on your application’s unique requirements:

AOT compilation: Optimal for complex models with known target SoCs. It minimizes initialization and memory footprint at launch for an "instant-start" experience.
On-device compilation: Best for distributing small models across various platforms. It requires no preparation, though first-run initialization costs are higher.

We are collaborating closely with silicon leaders across the industry to bring high-performance NPU acceleration to developers. Our first production-ready integrations with MediaTek and Qualcomm are available now. Read our technical deep-dives to see how we achieved best-in-class NPU performance, reaching speeds up to 100x faster than CPU and 10x faster than GPU:

A real-time, on-device Chinese assistant with vision & audio multimodality, powered by Gemma 3n 2B. Running on Vivo 300 Pro with the MediaTek Dimensity 9500 NPU. (left) Scene understanding using FastVLM vision modality running on Snapdragon 8 Elite Gen 5 with Xiaomi 17 Pro Max. (right)

Building on this momentum, we are actively expanding LiteRT’s NPU support to additional hardware. Stay tuned for further announcements!

Superior Cross-platform GenAI support

Open models offer unparalleled flexibility and customization, yet deploying them remains a high-friction process. Navigating the complexities of model lowering, inference, and benchmarking often demands significant engineering overhead. To bridge this gap and enable developers to build custom experiences efficiently, we provide the following integrated tech stack:

LiteRT Torch Generative API: A Python module for authoring and converting transformer-based PyTorch models into the LiteRT-LM/LiteRT formats. It provides optimized building blocks that ensure high-performance execution on edge devices.
LiteRT-LM: A specialized orchestration layer built on top of LiteRT to manage LLM-specific complexities. It is the battle-tested infrastructure powering Gemini Nano deployment across Google products, including Chrome and Pixel Watch.
LiteRT Converter & Runtime: The foundational engine that provides efficient model conversion, runtime execution, and optimization, empowering advanced hardware acceleration across CPU, GPU, and NPU, delivering state-of-the-art performance across edge platforms.

Together, these components offer a production-grade path for running popular open models with leading performance. To demonstrate this, we benchmarked Gemma 3 1B on Samsung Galaxy S25 Ultra, comparing LiteRT and Llama.cpp.

LiteRT demonstrates a clear performance advantage, outperforming llama.cpp on CPU/GPU for both prefill and decode (memory-bound). Furthermore, LiteRT’s NPU acceleration delivers an additional 3x performance gain over the GPU for prefill, maximizing the potential of compute hardware. For a detailed look at the engineering behind these benchmarks, read our deep dive into LiteRT’s optimizations under-the-hood.

LiteRT supports an extensive and growing collection of popular open-weight models, meticulously optimized and pre-converted for immediate deployment, including:

Gemma family: Gemma 3 (270M, 1B), Gemma 3n, EmbeddingGemma, and FunctionGemma.
Qwen, Phi, FastVLM and more.

AI Edge Gallery app demos powered by LiteRT: TinyGarden (left) and Mobile Actions (right), built with FuntionGemma.

These models are available on the LiteRT Hugging Face Community and can be explored interactively via the Google AI Edge Gallery app on Android/Play and iOS.

For more development details, visit our LiteRT GenAI documentation.

Broad ML framework support

Deployment shouldn't be dictated by your choice of training framework. LiteRT offers seamless model conversion from the industry's most popular ML frameworks: PyTorch, TensorFlow, and JAX.

PyTorch support: With the LiteRT Torch library, you can convert your PyTorch models directly to the .tflite format in a single, streamlined step. This ensures that PyTorch-based architectures are immediately ready to take full advantage of LiteRT's advanced hardware acceleration, eliminating the need for complex intermediate translations.
TensorFlow and JAX: LiteRT continues to provide robust, best-in-class support for the TensorFlow ecosystem and a reliable conversion path for JAX models via the jax2tf bridge. This ensures that state-of-the-art research from any of Google’s core ML libraries can be deployed efficiently to billions of devices.

By consolidating these paths, LiteRT enables high research-to-production velocity regardless of your development environment. You can author models in your preferred framework and rely on LiteRT to deliver performance across CPU, GPU, and NPU backends.

To get started, explore the LiteRT Torch Colab and try the conversion process yourself, or dive into the technical details of our PyTorch integration in this tech deep dive.

Reliability and compatibility you can trust

While the capabilities of LiteRT have significantly expanded, our commitment to long-term reliability and cross-platform consistency remains unchanged. LiteRT continues to build on the proven .tflite model format, the industry-standard, single-file format that ensures your existing models remain portable and compatible across Android, iOS, macOS, Linux, Windows, Web, and IOT.

To provide developers with a continuous experience, LiteRT offers robust support for both existing and next-generation execution paths:

The interpreter API: Your existing production models will continue to run reliably, maintaining the broad reach and rock-solid stability you depend on.
The new CompiledModel API: Designed for the next generation of AI, this modern interface provides a seamless path to unlock the full potential of GPU and NPU acceleration to fulfill your new AI needs. See more reasons to choose the CompiledModel API from the documentation.

What’s next

Ready to build the future of on-device AI? Get started today with these essential resources:

Explore the LiteRT Documentation for comprehensive development guides.
Check out the LiteRT GitHub and LiteRT Samples Github for sample code and implementation details.
Visit LiteRT Hugging Face Community for ready-to-use open models like Gemma, and try Google AI Edge Gallery app on Android and iOS to experience AI in action.

Let us know your feedback and feature requests by opening an issue on our GitHub channel. We can’t wait to see what you build with LiteRT!

Acknowledgements

Thank you to the members of the team, and collaborators for their contributions in making the advancements in this release possible: Advait Jain, Andrew Zhang, Andrei Kulik, Akshat Sharma, Arian Arfaian, Byungchul Kim, Changming Sun, Chunlei Niu, Chun-nien Chan, Cormac Brick, David Massoud, Dillon Sharlet, Fengwu Yao, Gerardo Carranza, Jingjiang Li, Jing Jin, Grant Jensen, Jae Yoo, Juhyun Lee, Jun Jiang, Kris Tonthat, Lin Chen, Lu Wang, Luke Boyer, Marissa Ikonomidis, Matt Kreileder, Matthias Grundmann, Majid Dadashi, Marko Ristić, Matthew Soulanille, Na Li, Ping Yu, Quentin Khan, Raman Sarokin, Ram Iyengar, Rishika Sinha, Sachin Kotwani, Shuangfeng Li, Steven Toribio, Suleman Shahid, Teng-Hui Zhu, Terry (Woncheol) Heo, Vitalii Dziuba, Volodymyr Kysenko, Weiyi Wang, Yu-Hui Chen, Pradeep Kuppala and gTech team.

posted in:

Mobile AI How-To Guides

On-Device Function Calling in Google AI Edge Gallery

FEB. 26, 2026

Cloud Announcements

What's new in TensorFlow 2.21

MARCH 6, 2026

AI Cloud Announcements Best Practices

Supercharge your AI agents: The New ADK Integrations Ecosystem

FEB. 27, 2026

Mobile Web Announcements Solutions

Introducing A2UI: An open project for agent-driven interfaces

DEC. 15, 2025