Introducing Gemma 3n: Developer's Guide

The Gemma's first model launched early last year and has since grown into a thriving Gemmaverse, with over 160 million downloads collectively. This ecosystem includes our family of a dozen specialized models for everything from security to medical applications and, most inspiringly, countless community-driven innovations. From innovators like Roboflow building enterprise computer vision to the Tokyo Institute of Science creating powerful Japanese variants of Gemma, your work has shown us the way forward.

Building on this incredible momentum, we're excited to announce the full release of Gemma 3n. One sec last month's preview took a look, today it unlocks the full power of this mobile architecture. Gemma 3n is dedicated to the developer community that helped shape Gemma. It's powered by your favorite tools, including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and more, allowing you to easily tune and deploy for specific apps on your device. This post is a deep dive for developers: we'll take a look at some of the innovations behind Gemma 3n, share new benchmark results, and show you how to start building today.


What's new in Gemma 3n?

Gemma 3n represents a significant advancement in on-device AI, bringing powerful multimodal capabilities to edge devices with performance previously only seen in last year's pioneering cloud-based models.

Achieving this leap in device performance required rethinking the model from the ground up. The basis is the unique mobile architecture of Gemma 3n, and it all starts with MatFormer.

MatFormer: One model, many sizes

The heart of Gemma 3n is Food shapes (🪆Matryoshka Transformer) architecturea novel nested transformer built for flexible inference. Think of it like Matryoshka dolls: the larger model contains smaller, fully functional versions of itself. This approach extends the concept Learning the representation of Matryoshka starting from embedding to all transformer components.

While MatFormer is training the effective parameter model 4B (E4B), it is simultaneously optimizing the effective parameter submodel 2B (E2B), as shown in the figure above. This provides developers with two powerful capabilities and use cases today:

1: Pre-extracted models: You can directly download and use the main E4B model for ultimate performance, or the standalone E2B sub-model we've already extracted for you, offering up to 2x faster inference.

2: Custom sizes with Mix-n-Match: For more granular control tailored to your specific hardware constraints, you can create a spectrum of custom-sized models between E2B and E4B using a method we call Mix-n-Match. This technique allows for precise slicing of the parameters of the E4B model, primarily by adjusting the latent dimension of the feedback network per layer (from 8192 to 16384) and selectively skipping certain layers. We share Former food laboratorya tool showing how to recover those optimal models that have been identified by evaluating various settings in benchmarks such as MMLU.

MMLU results for pre-trained Gemma 3n checkpoints for various model sizes (using Mix-n-Match)

Looking to the future, the MatFormer architecture also paves the way flexible design. Although not part of current implementations, this feature allows a single deployed E4B model to dynamically switch between E4B and E2B inference paths on the fly, enabling real-time performance and memory utilization optimization based on the current task and device load.

Per Layer Embedding (PLE): Unlocking Greater Memory Performance

Gemma 3n models include Per Layer Deposition (PLE). This innovation is tailored for on-device deployment because it dramatically improves model quality without increasing the need for fast memory required by the device's accelerator (GPU/TPU).

While the Gemma 3n E2B and E4B models have a total number of parameters of 5B and 8B, respectively, PLE allows a significant portion of these parameters (the embeddings associated with each layer) to be efficiently loaded and efficiently computed on the processor. This means that only transformer cores weighing (approximately 2B for E2B and 4B for E4B) need to reside in the usually more limited accelerator memory (VRAM).

Deposition per layer

Thanks to layer deposition, you can use Gemma 3n E2B with only ~2B parameters loaded in the accelerator.

KV cache sharing: faster long context processing

Processing long input data, such as sequences from audio and video streams, is crucial for many advanced multimodal applications on devices. Gemma 3n introduces KV Cache Sharing, a feature designed to significantly accelerate time to first token in stream response applications.

Sharing the KV cache optimizes how the model handles the initial input processing stage (often called the “pre-fill” phase). Middle layer keys and values ​​from local and global attention are directly shared with all upper layers, providing a noticeable 2x improvement in prefill performance compared to Gemma 3 4B. This means that the model can accept and understand long sequences of prompts much faster than before.

Sound Comprehension: An Introduction to Speech in Text and Translation

Gemma 3n uses an advanced audio encoder based on Universal Speech Model (USM). The encoder generates a token for every 160 ms of audio (approximately 6 tokens per second), which are then integrated as input to the language model, providing a detailed representation of the audio context.

This integrated audio feature unlocks key features for on-device programming, including:

  • Automatic speech recognition (ASR): Enable high-quality speech-to-text transcription directly on your device.
  • Automatic Speech Translation (AST): Translate spoken language into text in another language.

We saw particularly strong AST results for translations between English and Spanish, French, Italian and Portuguese, which offers great potential for developers working on applications in these languages. For tasks such as speech translation, using chain-of-thought prompts can significantly improve results. Here is an example:

user
Transcribe the following speech segment in Spanish, then translate it into English: 

model

Plain text

At launch, the Gemma 3n encoder is implemented to process audio clips up to 30 seconds long. However, this is not a fundamental limitation. The basic audio encoder is a stream encoder, capable of processing arbitrarily long audio files with additional long audio training. Subsequent rollouts will unlock low-latency, long-streaming applications.


MobileNet-V5: New state-of-the-art video encoder

In addition to integrated audio features, Gemma 3n features a new, high-performance video encoder, MobileNet-V5-300Mdelivering state-of-the-art performance for multimodal workloads on edge devices.

Designed for flexibility and performance on limited hardware, MobileNet-V5 provides developers with:

  • Multiple input resolutions: Natively supports resolutions of 256 x 256, 512 x 512 and 768 x 768 pixels, allowing you to balance performance and detail for specific applications.
  • Broad visual understanding: Collaboratively trained on extensive multimodal datasets, it excels at a wide range of image and video understanding tasks.
  • High throughput: Processes up to 60 frames per second on Google Pixel, enabling real-time on-device video analysis and interactive experiences.

This level of performance has been achieved through a number of architectural innovations, including:

  • Advanced foundation of MobileNet-V4 blocks (including Universal Inverted Bottlenecks and Mobile MQA).
  • Significantly enlarged architecture, featuring a hybrid deep pyramid model that is 10 times larger than the largest MobileNet-V4 variant.
  • Innovative Multi-Scale Fusion VLM adapter that improves token quality for greater accuracy and efficiency.

Using novel architectural designs and advanced distillation techniques, MobileNet-V5-300M significantly outperforms the baseline SoViT in Gemma 3 (trained with SigLip, no distillation). On Google Pixel Edge TPU, yes provides 13x speedup with quantization (6.5x without), requires 46% fewer parameters and has 4x less memory footprintall while providing significantly greater accuracy in visual-linguistic tasks

We're excited to share more about the work on this model. Look out for our upcoming MobileNet-V5 whitepaper, which details the model architecture, data scaling strategies, and advanced distillation techniques.

The priority was to ensure Gemma 3n was available from day one. We're proud to work with many amazing open source developers to provide broad support for popular tools and platforms, including contributions from the teams behind AMD, Axolotl, DockerHugged Face, lama.cpp, LMStudio, MLX, NVIDIAOllama, RedHat, SGLang, Unsloth and vLLM.

But this ecosystem is just the beginning. The real power of this technology lies in what you build with it. That's why we launch Gemma 3n Strike Challenge. Your mission: leverage Gemma 3n's unique on-device, offline and multimodal capabilities to build a product for a better world. With $150,000 in prizes, we're looking for a compelling video story and a “wow” demonstration that shows real-world impact. Join the challenge and help build a better future.

Start using Gemma 3n today

Ready to discover the potential of Gemma 3n today? Here's how:

  • Learn and integrate: Immerse yourself in ours extensive documentation to quickly integrate Gemma into your projects or get started with our inference and tuning guides.

LEAVE A REPLY

Please enter your comment!
Please enter your name here