The Neural Processing Unit (NPU) has become the critical enabler for the next generation of on-device AI. By delivering maximum performance of tens of TOPS (Tera Operations Per Second) with minimal power consumption, NPUs allow devices to run sophisticated, computationally heavy generative AI models that were previously impossible on standard edge devices.
These powerful NPUs are the engine behind a massive, diverse ecosystem of products, from flagship smartphones, laptops, and tablets, to smart home hubs and IoT devices. However, deploying AI on NPUs has often been difficult, hindering broad adoption. The NPU space is highly diverse, with hundreds of SoC variants targeting different device types, creating significant hurdles for developers to manage compilers and distribute runtimes. Existing on-device ML infrastructure is typically tailored for CPUs and GPUs, lacking deep integration with specialized NPU SDKs and their unique compilation needs. This has resulted in complex, ad-hoc deployment workflows. Moreover, enabling sophisticated GenAI models running efficiently on NPUs requires advanced optimization and special kernels, going far beyond simple operator delegation.
Together with MediaTek, we are excited to announce the new LiteRT NeuroPilot Accelerator, to help developers overcome these changes. This is a ground-up successor for the TFLite NeuroPilot delegate, bringing seamless deployment experience, state-of-the-art LLM support, and advanced performance to millions of devices worldwide.
Moving well beyond basic acceleration, the LiteRT NeuroPilot Accelerator provides a unified development workflow and sophisticated features designed to productionize AI on MediaTek NPUs. Here are the highlights:
Traditionally, developers needed to build for various combinations of SoC providers and SoC versions and had to manage the distribution of compiled models and runtimes for each combination. To solve this, we have created a simple, 3-step workflow to get your models running with NPU acceleration.
The full, detailed guide with a colab and sample app, is available on our LiteRT NPU documentation. Here is the high-level process:
With the new LiteRT NeuroPilot Accelerator, we’ve moved from a high-level wrapper to a direct, native integration with the NeuroPilot compiler and runtime. This enables a powerful Ahead-of-Time (AOT) compilation workflow that was previously out of reach, giving developers flexibility in their deployment strategy:
Here’s how the two approaches compare for a large model (e.g., Gemma 3 270M). As shown, on-device compilation for such a large model can take over a minute, making AOT the more practical choice for production.
On supported Android devices you can use Gemini Nano through ML Kit. For markets where Gemini Nano is not supported or if you have use cases that require deeper customization, we now unlock the full potential of open-weight models. This includes Google’s Gemma model family, a set of lightweight, state of the art open models from Google that are optimized specifically for on-device use cases.
As announced at MediaTek's recent Dimensity 9500 event, our collaboration brings optimized, production-ready support for the following models on their latest chipsets:
Powered by special optimizations targeting the MediaTek NPU, Gemma models are accelerated by up to 12x compared to CPU, and 10x compared to GPU. This delivers impressively fast inference, as shown in the performance benchmarks for Gemma and Qwen on the latest MediaTek Dimensity 9500 with Vivo X300 Pro:
As the results show, the Gemma 3n E2B model achieves over 1600 tokens/sec for prefill and 28 tokens/sec for decode (with 4K context) on the NPU. This speed enables sophisticated multimodal use cases.
To get started, you can find pre-compiled Gemma models for MediaTek NPU on the LiteRT HuggingFace Community. We provide two primary paths for integration, and pathways for both C/C++ and Kotlin/Java users.
1. For text generation (e.g., Gemma 3 270M) using LiteRT-LM: built on top of LiteRT, LiteRT-LM provides a high-level, stateful “text-in, text-out” API that simplifies inference with text generative models.
// 1. Define model assets and engine settings.
auto model_assets = ModelAssets::Create(model_path);
auto engine_settings = EngineSettings::CreateDefault(
model_assets, litert::lm::Backend::NPU); // Specify inference on NPU.
// 2. Create the main Engine object. This loads the model.
absl::StatusOr<std::unique_ptr<Engine>> engine = Engine::CreateEngine(engine_settings);
// 3. Create a Session for a new conversation.
auto session_config = SessionConfig::CreateDefault();
absl::StatusOr<std::unique_ptr<Engine::Session>> session = (*engine)->CreateSession(session_config);
// 4. Generate content using a high-level API.
absl::StatusOr<Responses> responses = (*session)->GenerateContent(
{InputText("What is the tallest building in the world?")});
// 5. Print the response.
std::cout << *responses << std::endl;
See the instructions from LiteRT-LM documentation for more details on setup MediaTek NeuroPilot and API usage for C++ and Kotlin.
2. For EmbeddingGemma, use LiteRT: EmbeddingGemma fits perfectly with LiteRT’s “tensor-in, tensor-out” API.
// 1. Set up inference options
auto env = Environment::Create({dispatch_options});
auto embedder_model_def = Model::CreateFromFile(embedder_path);
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// 2. Create LiteRT CompiledModel
LITERT_ASSIGN_OR_RETURN(auto embedder_model,
CompiledModel::Create(*env, *embedder_model_def, *options));
LITERT_ASSIGN_OR_RETURN(auto input_buffers, embedder_model->CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, embedder_model->CreateOutputBuffers());
// 3. Inference with inputs
LITERT_RETURN_IF_ERROR(input_buffers[0].Write<int>(token_ids));
LITERT_RETURN_IF_ERROR(
embedder_model->Run(input_buffers, output_buffers));
LITERT_RETURN_IF_ERROR(output_buffers[0].Read(output_embeddings));
See also the full instructions of C++ and Kotlin development from the LiteRT Documentation. An end-to-end example is available from the LiteRT Semantic Similarity demo app.
We’ll soon support converting a custom Gemma model for MediaTek NPU via LiteRT, and more NPU demos will be available on AI Edge Gallery soon.
To make building rich, real-time applications easier across varieties of platforms and devices, we’ve focused on improving the developer experience and data pipeline efficiency. This starts with a new, simplified C++ API. This is an improvement on the previous C API and makes it easier to build efficient, cross-platform ML applications.
Our new API was designed to work seamlessly with native hardware buffers. The accelerator now supports Native Hardware Buffer Interoperability, which enables two key efficiencies. First, it allows for zero-copy data passing with AHardwareBuffer. Second, it provides zero-copy interop between OpenGL/OpenCL buffers, common inputs/outputs of GPU image processing, and AHardwareBuffer. Instead of converting input/output data to and from the CPU, you can pass camera frames or video directly from other ML pipeline components to NPU via LiteRT. This is critical for building the high-throughput, real-time camera and video applications that are a key goal of this release.
Here is an example of GPU pre-processing followed by NPU inference with buffer interop support in LiteRT:
// Define a LiteRT environment to use existing EGL display and context.
const std::vector<Environment::Option> environment_options = {
{OptionTag::EglDisplay, user_egl_display},
{OptionTag::EglContext, user_egl_context}};
auto env = Environment::Create(absl::MakeConstSpan(environment_options));
// Load Model and initialize NPU runtime.
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("model.tflite"));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, model, HwAccelerator::kNpu));
// Prepare I/O buffers.
LITERT_ASSIGN_OR_RETURN(RankedTensorType tensor_type, model.GetInputTensorType("input_name0"));
// Create an input TensorBuffer directly from an OpenGL SSBO (GL Buffer).
LITERT_ASSIGN_OR_RETURN(auto tensor_buffer_from_opengl, TensorBuffer::CreateFromGlBuffer(env, tensor_type, GL_SHADER_STORAGE_BUFFER, gl_buffer_id, size_bytes, offset));
std::vector<TensorBuffer> input_buffers;
input_buffers.push_back(std::move(tensor_buffer_from_opengl));
// Create an output TensorBuffer of the model.
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());
// Run inference.
compiled_model.Run(input_buffers, output_buffers);
See more instructions in the LiteRT C++ API documentation, and the LiteRT Async Segmentation C++ demo app.
LiteRT now makes it easy to bring NPU-accelerated ML to millions of MediaTek devices through LiteRT NeuroPilot Accelerator, dramatically improving the user experience for a massive global audience.
LiteRT NPU support is now available to all developers. We encourage you to try it out today! Check out our example Colab, explore the Sample App, and dive into the official LiteRT Devsite for documentation and guides.
Special thanks to the Google ODML team and MediaTek team for their significant contributions in this effort:
Google ODML team: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Teng-Hui Zhu, Terry (Woncheol) Heo, Vitalii Dziuba, Weiyi Wang, Yu-Hui Chen, Zichuan Wei.
MediaTek team: Bo-Yan Lin, Chao-Yuan Lee, Cheng-Yen Lin, Chia-Lin Yu, Chiayu Sung, Christoph Kuo, Chuo-Ling Chang, Deep Yap, Hsienkai Kuo, HungChun Liu, Jush Lu, Kayden Yang, Lei Chen, Peng-Wen Chen, Poyuan Jeng, Tzu-hsuan Wei, Waimun Wong, Wen-Li Shih, YanRen Chang, Yi-Min Tsai, Yu-Chieh Lin, Yu-Ting Wan.