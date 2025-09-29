A journey from a Gemma 3 decoder to a powerful text embedder

How Embeddings Are Formed You can use EmbeddingGemma to generate embeddings using frameworks such as Sentence Transformers. Given an input sequence of text, EmbeddingGemma processes it through a series of carefully designed steps to produce a concise vector representation.

SentenceTransformer( (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'}) (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'}) (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'}) (4): Normalize() ) Python Copied

(0): Transformer An input sequence passes through this encoder-only transformer model. This transformer utilizes bidirectional attention to understand the meaning of each token in the provided context, producing a sequence of 768-dimensional vectors, one for each token in your input sequence. (1): Pooling The output of the transformer is a sequence of token embeddings. The pooling layer’s job is to convert this variable-length sequence into a single, fixed-size embedding for the entire input. EmbeddingGemma is using a pooling strategy called “Mean Pooling”. This is the most common approach, where the average of all token embeddings is calculated. (2): Dense Next, we apply a linear projection to scale the embedding (768) up to a larger embedding dimension (3072). (3): Dense Then we apply another linear projection to scale the learned 3072-dimensional embedding to the final target dimension (768). (4): Normalize Finally, we apply Euclidean normalization, enabling efficient similarity comparisons. This is a simpler and cheaper operation compared to the more complex RMSNorm that you might recall from other Gemma models.

a visualization of EmbeddingGemma, how final embeddings are formed from input text

How EmbeddingGemma Learns EmbeddingGemma learns to create its powerful embeddings by optimizing a combination of three distinct, weighted loss functions during its training. 1. Noise-Contrastive Estimation (NCE) Loss The NCE loss teaches the model the fundamental concepts of similarity and contrast. For each input (e.g., a query), it learns to: Pull "positive pairs" closer together: The model is trained to minimize the distance between the query and its correct answer in the embedding space.

The model is trained to minimize the distance between the query and its correct answer in the embedding space. Push "negative pairs" further apart: Simultaneously, it maximizes the distance between the query and incorrect answers from the same training batch. The key is the inclusion of "hard negatives" (answers that are semantically similar to the query but are incorrect or incomplete). By training on these tricky examples, the model is forced to learn the subtle, fine-grained distinctions that separate correct from nearly-correct ones. It’s like building a well-organized library, where related items are placed near each other, while unrelated items are kept distant. 2. Global Orthogonal Regularizer (GOR) This loss is designed to encourage EmbeddingGemma to produce embeddings that are spread out over the embedding space. Even if the model learned to separate similar and dissimilar things, it might still get lazy and just stack all embeddings in the same small corner. This Regularizer makes embeddings robust to quantization and enables efficient search in vector databases using Approximate Nearest Neighbor (ANN) algorithms. 3. Geometric Embedding Distillation This loss serves as a form of knowledge distillation, where EmbeddingGemma learns from a larger, more powerful Gemini Embedding model as a teacher. The loss minimizes the L2 distance (a measure of difference) between the two embedding models’ embeddings for queries and passages. This enables EmbeddingGemma to learn from the teacher model, effectively inheriting much of its knowledge and capabilities.

a visualization of three different loss functions (NCE Loss, GOR, and Distillation Loss) that EmbeddingGemma uses for training

By combining these three loss functions, EmbeddingGemma learns to produce representations that are well-structured, expressive, and robust, enabling strong performance in real-world search and retrieval tasks. Matryoshka Representation Learning (MRL) MRL is a technique that allows nesting smaller, high-quality representations within a larger one. For example, even though EmbeddingGemma’s embeddings have 768 dimensions, you can truncate the embeddings and get smaller ones with 512, 256, or even 128 dimensions which retain high quality. During training, the loss functions are not just applied to the final 768-dimensional embedding, but also to the overlapping subsets of that embedding (the first 512, 256, and 128 dimensions). This ensures that even a truncated version of the full embedding is a powerful and complete representation. For you, this means you can choose the right trade-off between performance and efficiency for your application without needing to train or manage multiple models. Simply select the embedding size that best fits your needs, ranging from the full 768 dimensions for maximum quality to smaller sizes for increased speed and lower storage costs.

a visualization of Matryoshka Embeddings, showing how varying dimensionality influences the trade-off between quality and efficiency