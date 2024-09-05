PaliGemma adds an additional vision model to the BaseGemma model, which consists of an image encoder. This encoder along with the text tokens is passed to a specialized Gemma 2B model. Both the Vision Model and Gemma model are trained in various stages both independently, and together, to produce the final joint architecture. For full details see Section 3.2 of the Pali-3 paper

PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3 , and based on open components like the SigLIP vision model and the Gemma language model . Pali stands for Pa thway L anguage and I mage Model. As the name implies this model is able to take both image and text inputs and produce a text response, as you can see in this fine tuning guide .

In the previous post of Gemma explained, you reviewed RecurrentGemma architecture. In this blog post, you will explore PaliGemma architecture. Let’s dive into it!

vision_tower (SiglipVisionModel)

This component is responsible for processing the input image.

It uses SiglipVisionTransformer which is a type of transformer architecture designed for vision tasks.



embeddings (SiglipVisionEmbeddings)

PaliGemma takes as input one or more images, which are turned into “soft tokens” by the SigLIP encoder.

It breaks the image into smaller patches, similar to how a text model processes words in a sentence. The model then learns to capture relationships between these patches, effectively understanding the image’s visual content.



patch_embedding

It uses a convolutional layer (Conv2d) with the following parameters.

3: The input has 3 channels (for RGB images)

1152: The output has 1152 channels, which is the embedding dimension of each patch

kernel_size=(14, 14): Each patch is a 14x14 pixel square

stride=(14, 14): The patches are taken with no overlap (the convolutional filter moves 14 pixels at a time)

padding=’valid’: No padding is applied, so the output size will be smaller than the input size.



position_embedding

Position embeddings are added to each patch embedding to encode the spatial information (i.e., where each patch was located in the original image).

This is done using a learned embedding layer (Embedding) that takes as input the position of each patch (up to 256 positions) and outputs a vector of size 1152 (the same as the patch embedding dimension).



encoder (SiglipEncoder)

The embeddings pass through a series of SiglipEncoderLayer, each consisting of self-attention and feed-forward neural networks. This helps the model capture relationships between different parts of the image.



multi_modal_projector (PaliGemmaMultiModalProjector)

This component projects the output of the vision tower into a multi-modal space. This is achieved using a simple linear layer and it allows the vision and language representations to be combined effectively.



language_model (GemmaForCausalLM)

This component is a language model based on the Gemma 2B model.

It takes as input the multi-modal representation from the projector and generates text output.

For the text input, each checkpoint was trained with various sequence lengths. For example, paligemma-3b-mix-224 was trained with sequence length 256 (input text + output text tokenized by Gemma’s tokenizer).

PaliGemma uses the Gemma tokenizer with 256000 tokens, but extends its vocabulary with 1024 entries that represent coordinates in normalized image-space (<loc0000>...<loc1023>), and another with 128 entries (<seg000>...<seg127>) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE). (256000 + 1024 + 128 = 257216)



Object Segmentation Example

Additional soft tokens encode object detection and image segmentation. Below is an example output from the paligemma-3b-mix-224. You can try it by yourself from the HuggingFace live demo.