MediaTek NPU and LiteRT: Powering the next generation of on-device AI

DEC. 8, 2025
Lu Wang Senior Staff Software Engineer
Arian Arfaian Staff Software Engineer
Luke Boyer Software Engineer

The Neural Processing Unit (NPU) has become the critical enabler for the next generation of on-device AI. By delivering maximum performance of tens of TOPS (Tera Operations Per Second) with minimal power consumption, NPUs allow devices to run sophisticated, computationally heavy generative AI models that were previously impossible on standard edge devices.

Smart grouping powered by on-device large language model running on MediaTek Kompanio Ultra NPU with Chromebook Plus 14

These powerful NPUs are the engine behind a massive, diverse ecosystem of products, from flagship smartphones, laptops, and tablets, to smart home hubs and IoT devices. However, deploying AI on NPUs has often been difficult, hindering broad adoption. The NPU space is highly diverse, with hundreds of SoC variants targeting different device types, creating significant hurdles for developers to manage compilers and distribute runtimes. Existing on-device ML infrastructure is typically tailored for CPUs and GPUs, lacking deep integration with specialized NPU SDKs and their unique compilation needs. This has resulted in complex, ad-hoc deployment workflows. Moreover, enabling sophisticated GenAI models running efficiently on NPUs requires advanced optimization and special kernels, going far beyond simple operator delegation.

Together with MediaTek, we are excited to announce the new LiteRT NeuroPilot Accelerator, to help developers overcome these changes. This is a ground-up successor for the TFLite NeuroPilot delegate, bringing seamless deployment experience, state-of-the-art LLM support, and advanced performance to millions of devices worldwide.

Key features of the LiteRT NeuroPilot Accelerator

Moving well beyond basic acceleration, the LiteRT NeuroPilot Accelerator provides a unified development workflow and sophisticated features designed to productionize AI on MediaTek NPUs. Here are the highlights:

  • Seamless and unified deployment workflow: The accelerator provides easy access to various MediaTek NPUs via a unified API, abstracting away SDK complexities. You can choose between two distinct compilation workflows: offline (Ahead-of-Time, a.k.a. AOT) and online (on-device), giving you the flexibility to choose the best strategy for your application, whether it's minimizing first-run latency or enabling platform-agnostic model distribution.
  • Rich generative AI capabilities: Our collaboration with MediaTek unlocks the full potential of state-of-the-art models like the Gemma family. This enables building and deploying sophisticated generative AI features, from advanced text generation to new multimodal applications, directly on NPU.
  • Efficient, cross-platform development: We’ve introduced a new, simplified C++ API (an improvement on the previous C API) that makes building highly efficient ML pipelines easier. This new API works seamlessly with Native Hardware Buffer Interoperability, allowing for zero-copy data passing from AHardwareBuffer directly to the NPU, as well as automatic conversion from OpenGL/OpenCL buffers. This is critical for building high-throughput, real-time camera and video applications.

Seamless and unified deployment workflow

Traditionally, developers needed to build for various combinations of SoC providers and SoC versions and had to manage the distribution of compiled models and runtimes for each combination. To solve this, we have created a simple, 3-step workflow to get your models running with NPU acceleration.

The full, detailed guide with a colab and sample app, is available on our LiteRT NPU documentation. Here is the high-level process:

  • Step 1: AOT Compilation for the target SoCs (optional) . You simply use the LiteRT Python library to compile your .tflite model to the supported SoCs. See more details in this LiteRT AOT Compilation Tutorial. While optional, AOT compilation is highly recommended for larger models to reduce on-device initialization time. This step is not required for on-device compilation.
  • Step 2: Deploy with Google Play for On-device AI (PODAI) if on Android. Use LiteRT to export the model assets and required runtime libraries into an "AI Pack", the format used by PODAI. Copy the AI Pack to your Android app project. When users install your app from Google Play, it analyzes the user's device and automatically delivers the model and runtime to a compatible device.
  • Step 3: Inference using LiteRT Runtime. LiteRT abstracts away the complexity of hardware fragmentation. For both AOT and on-device compilation, you simply load the model and specify Accelerator.NPU in the options. LiteRT handles the rest, and even includes a robust fallback mechanism: you can specify GPU or CPU as secondary options, and LiteRT will automatically use them if the NPU is unavailable.

AOT and on-device compilation

With the new LiteRT NeuroPilot Accelerator, we’ve moved from a high-level wrapper to a direct, native integration with the NeuroPilot compiler and runtime. This enables a powerful Ahead-of-Time (AOT) compilation workflow that was previously out of reach, giving developers flexibility in their deployment strategy:

  • Offline (AOT) compilation: This is best suited for large, complex models where the target SoC is known. Compiling ahead-of-time significantly reduces initialization costs and lowers memory usage when the user launches your app.
  • Online (on-device) compilation: This is ideal for platform-agnostic model distribution of small models. The model is compiled on the user's device during initialization, requiring no extra preparation step but incurring a higher first-run cost.

Here’s how the two approaches compare for a large model (e.g., Gemma 3 270M). As shown, on-device compilation for such a large model can take over a minute, making AOT the more practical choice for production.

Gemma 3 270 AOT_JIT

Rich generative AI capabilities with Gemma and other open-weight models

On supported Android devices you can use Gemini Nano through ML Kit. For markets where Gemini Nano is not supported or if you have use cases that require deeper customization, we now unlock the full potential of open-weight models. This includes Google’s Gemma model family, a set of lightweight, state of the art open models from Google that are optimized specifically for on-device use cases.

As announced at MediaTek's recent Dimensity 9500 event, our collaboration brings optimized, production-ready support for the following models on their latest chipsets:

  • Qwen3 0.6B: Foundation models that power new AI experiences in Mainland China by OEMs like Xiaomi, Huawei, and Vivo.
  • Gemma 3 270M: A hyper-efficient and compact base model designed for task-specific post fine-tuning, enabling high-speed, low-latency features like sentiment analysis or entity extraction in resource-constrained environments.
  • Gemma 3 1B: A lightweight and multilingual text-only model that balances compact size with strong generative capabilities, making it ideal for a wide range of on-device reasoning, summarization, and content creation tasks.
  • Gemma 3n E2B: A mobile-first, powerful multimodal model that natively understands audio, vision, and text, purpose-built for low-latency applications like real-time speech translation and visual understanding.
  • EmbeddingGemma 300M: A state-of-the-art text embedding model that produces high-quality embeddings on-device, great for Retrieval Augmented Generation (RAG), semantic search, and classification.

Powered by special optimizations targeting the MediaTek NPU, Gemma models are accelerated by up to 12x compared to CPU, and 10x compared to GPU. This delivers impressively fast inference, as shown in the performance benchmarks for Gemma and Qwen on the latest MediaTek Dimensity 9500 with Vivo X300 Pro:

Perf_data_12-04

As the results show, the Gemma 3n E2B model achieves over 1600 tokens/sec for prefill and 28 tokens/sec for decode (with 4K context) on the NPU. This speed enables sophisticated multimodal use cases.

A real-time, on-device Chinese assistant with vision & audio multimodality, powered by Gemma 3n 2B. Running on Vivo 300 Pro with the MediaTek Dimensity 9500 NPU. (Left) Recognizing a dish and providing cooking instructions. (Middle) Identifying a plant and suggesting care tips. (Right) Generating a one-day itinerary for San Francisco.

How to deploy Gemma

To get started, you can find pre-compiled Gemma models for MediaTek NPU on the LiteRT HuggingFace Community. We provide two primary paths for integration, and pathways for both C/C++ and Kotlin/Java users.

1. For text generation (e.g., Gemma 3 270M) using LiteRT-LM: built on top of LiteRT, LiteRT-LM provides a high-level, stateful “text-in, text-out” API that simplifies inference with text generative models.

// 1. Define model assets and engine settings.
auto model_assets = ModelAssets::Create(model_path);
auto engine_settings = EngineSettings::CreateDefault(
    model_assets, litert::lm::Backend::NPU); // Specify inference on NPU.

// 2. Create the main Engine object. This loads the model.
absl::StatusOr<std::unique_ptr<Engine>> engine = Engine::CreateEngine(engine_settings);

// 3. Create a Session for a new conversation.
auto session_config = SessionConfig::CreateDefault();
absl::StatusOr<std::unique_ptr<Engine::Session>> session = (*engine)->CreateSession(session_config);

// 4. Generate content using a high-level API.
absl::StatusOr<Responses> responses = (*session)->GenerateContent(
    {InputText("What is the tallest building in the world?")});

// 5. Print the response.
std::cout << *responses << std::endl;
C++

See the instructions from LiteRT-LM documentation for more details on setup MediaTek NeuroPilot and API usage for C++ and Kotlin.

2. For EmbeddingGemma, use LiteRT: EmbeddingGemma fits perfectly with LiteRT’s “tensor-in, tensor-out” API.

// 1. Set up inference options
auto env = Environment::Create({dispatch_options});
auto embedder_model_def = Model::CreateFromFile(embedder_path);
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// 2. Create LiteRT CompiledModel
LITERT_ASSIGN_OR_RETURN(auto embedder_model,
    CompiledModel::Create(*env, *embedder_model_def, *options));
LITERT_ASSIGN_OR_RETURN(auto input_buffers, embedder_model->CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, embedder_model->CreateOutputBuffers());

// 3. Inference with inputs
LITERT_RETURN_IF_ERROR(input_buffers[0].Write<int>(token_ids));
LITERT_RETURN_IF_ERROR(
    embedder_model->Run(input_buffers, output_buffers));
LITERT_RETURN_IF_ERROR(output_buffers[0].Read(output_embeddings));
C++

See also the full instructions of C++ and Kotlin development from the LiteRT Documentation. An end-to-end example is available from the LiteRT Semantic Similarity demo app.

We’ll soon support converting a custom Gemma model for MediaTek NPU via LiteRT, and more NPU demos will be available on AI Edge Gallery soon.

Efficient, cross-platform development

To make building rich, real-time applications easier across varieties of platforms and devices, we’ve focused on improving the developer experience and data pipeline efficiency. This starts with a new, simplified C++ API. This is an improvement on the previous C API and makes it easier to build efficient, cross-platform ML applications.

Our new API was designed to work seamlessly with native hardware buffers. The accelerator now supports Native Hardware Buffer Interoperability, which enables two key efficiencies. First, it allows for zero-copy data passing with AHardwareBuffer. Second, it provides zero-copy interop between OpenGL/OpenCL buffers, common inputs/outputs of GPU image processing, and AHardwareBuffer. Instead of converting input/output data to and from the CPU, you can pass camera frames or video directly from other ML pipeline components to NPU via LiteRT. This is critical for building the high-throughput, real-time camera and video applications that are a key goal of this release.

Here is an example of GPU pre-processing followed by NPU inference with buffer interop support in LiteRT:

// Define a LiteRT environment to use existing EGL display and context.
const std::vector<Environment::Option> environment_options = {
   {OptionTag::EglDisplay, user_egl_display},
   {OptionTag::EglContext, user_egl_context}};
auto env = Environment::Create(absl::MakeConstSpan(environment_options));

// Load Model and initialize NPU runtime. 
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("model.tflite"));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, model, HwAccelerator::kNpu));

// Prepare I/O buffers.
LITERT_ASSIGN_OR_RETURN(RankedTensorType tensor_type, model.GetInputTensorType("input_name0"));
// Create an input TensorBuffer directly from an OpenGL SSBO (GL Buffer).  
LITERT_ASSIGN_OR_RETURN(auto tensor_buffer_from_opengl, TensorBuffer::CreateFromGlBuffer(env, tensor_type, GL_SHADER_STORAGE_BUFFER, gl_buffer_id, size_bytes, offset));
std::vector<TensorBuffer> input_buffers;
input_buffers.push_back(std::move(tensor_buffer_from_opengl));

// Create an output TensorBuffer of the model. 
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Run inference. 
compiled_model.Run(input_buffers, output_buffers);
C++

See more instructions in the LiteRT C++ API documentation, and the LiteRT Async Segmentation C++ demo app.

Looking ahead

LiteRT now makes it easy to bring NPU-accelerated ML to millions of MediaTek devices through LiteRT NeuroPilot Accelerator, dramatically improving the user experience for a massive global audience.

LiteRT NPU support is now available to all developers. We encourage you to try it out today! Check out our example Colab, explore the Sample App, and dive into the official LiteRT Devsite for documentation and guides.

Acknowledgements

Special thanks to the Google ODML team and MediaTek team for their significant contributions in this effort:

Google ODML team: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Teng-Hui Zhu, Terry (Woncheol) Heo, Vitalii Dziuba, Weiyi Wang, Yu-Hui Chen, Zichuan Wei.

MediaTek team: Bo-Yan Lin, Chao-Yuan Lee, Cheng-Yen Lin, Chia-Lin Yu, Chiayu Sung, Christoph Kuo, Chuo-Ling Chang, Deep Yap, Hsienkai Kuo, HungChun Liu, Jush Lu, Kayden Yang, Lei Chen, Peng-Wen Chen, Poyuan Jeng, Tzu-hsuan Wei, Waimun Wong, Wen-Li Shih, YanRen Chang, Yi-Min Tsai, Yu-Chieh Lin, Yu-Ting Wan.