Introducing PaliGemma 2 mix: A vision-language model for multiple tasks

FEB 19, 2025
Omar Sanseviero Staff Developer Relations Engineer
Andreas Steiner Staff Software Engineer

This past December, we launched PaliGemma 2, an upgraded vision-language model in the Gemma family. The release included pretrained checkpoints of different sizes (3B, 10B, and 28B parameters) that can be easily fine-tuned on a wide range of vision-language tasks and domains, such as image segmentation, short video captioning, scientific question answering and text-related tasks with high performance.

Now, we’re thrilled to announce the launch of PaliGemma 2 mix checkpoints. PaliGemma 2 mix are models tuned to a mixture of tasks that allow directly exploring the model capabilities and using it out-of-the-box for common use cases.


What’s new in PaliGemma 2 mix?

  • Multiple tasks with one model: PaliGemma 2 mix can solve tasks such as short and long captioning, optical character recognition (OCR), image question answering, object detection and segmentation.

  • Developer-friendly sizes: Use the best model for your needs thanks to the different model sizes (3B, 10B, and 28B parameters) and resolutions (224px and 448px).

If you were already using the original PaliGemma mix checkpoints, you can directly upgrade to PaliGemma 2 without needing to do any changes. The model performs different tasks depending on how it’s prompted. You can review the different prompt task syntax in the official documentation and learn more about how PaliGemma 2 was developed in our technical report.


Detection

  • Task: Detection (PaliGemma-2-3b-mix-224)
  • Input: "detect android\n"
Input - "detect android\n"

Result:

Result in PaliGemma 2 Mix: A large, green Android figure stands on a white platform, enclosed by a red box. The word "android" is written in red above the figure.

Multiple Object Detection

  • Task: Multiple Object Detection (PaliGemma-2-3b-mix-224)
  • Input: “detect chair ; table\n”
Multiple object detection of items in a dining room

Result:

A wooden table and chair are in the foreground. Additional tables and chairs can be seen in the background within a room with a bee patterned wall and wooden floors. Labeled boxes highlight the furniture with the text "table" and "chair."
  • Task: Multiple Object Detection (PaliGemma-2-3b-mix-224)
  • Input - "detect food ; plate ; bowl\n"
Plates and bowls of food on a wooden table

Result:

Plates and bowls of food on a wooden table labeled with boxes that accurately identify "plate", "bowl" and "food"

Optical Character Recognition (OCR)

  • Task: Multiple Object Detection (PaliGemma-2-3b-mix-224)
  • Input - "ocr\n"
Lighting labels in Japanese kanji

Result:

Japanese Kanji reads: Downlight, Dining Room, Kitchen, Living Room, Bathroom/Dressing Room]

Segmentation

  • Task: Segmentation (PaliGemma-2-3b-mix-224) [Image generated by ImageFX]
  • Input - "segment cat\n"
Image of a cat looking at the camera behind a wooden sign that reads 'Hello PaliGemma 2' generated by ImageFX

Result:

highlighted image of a cat looking at the camera behind a wooden sign that reads 'Hello PaliGemma 2' generated by ImageFX

Question Answering

  • Task: Question Answering (PaliGemma2-mix-3b-448) [Image generated by ImageFX]
  • Input: “answer en where is the cow standing?\n"
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result: beach


Captioning

  • Input: “caption en\n”
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result: a cow standing on a beach next to a sign that says warning dangerous rip current.


Optical Character Recognition (OCR)

A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result:

WARNING

DANGEROUS

RIP CURRENT


Detection

  • Input: “detect cow\n”
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result:

A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking. A red box outlines the cow, with a label that reads "cow"

Segmentation

  • Input: “segment cow\n”
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result:

A highlighted cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Captioning

  • Task: Captioning (PaliGemma 2-mix-10b-448)
  • Input: “caption en\n”
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result: A cow standing on a beach next to a warning sign.

Optical Character Recognition (OCR)

  • Task: "ocr\n"
A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

Result:

WARNING DANGEROUS

RIP CURRENT


Get Started Today

Ready to discover the potential of PaliGemma 2? Here is how you can explore the mix model capabilities:

  • Try out the mix model with a few clicks: Explore the mix model capabilities directly on the Hugging Face demo.

  • Learn how to run the model: Try out the Keras inference notebook directly in Google Colab or locally.


While PaliGemma 2 mix has strong performance across multiple tasks, you will get the best results by fine-tuning PaliGemma 2 in your own task or domain. To learn how to do it, dive into our comprehensive documentation, check our official example notebooks for Keras and JAX, or use the Hugging Face transformers example. We’re looking forward to seeing what you build with it!