Introducing PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning

DEC. 5, 2024

Daniel Keysers Research Engineer

Andreas Steiner Staff Software Engineer

Building custom, advanced AI that can "see" used to be a complex and resource-intensive endeavor. Not anymore. This past May, we launched PaliGemma, the first vision-language model in the Gemma family, taking a significant step toward making class-leading visual AI more accessible. Now, we're thrilled to introduce PaliGemma 2, the next evolution in tunable vision-language models.

PaliGemma 2 builds upon the performant Gemma 2 models, adding the power of vision and making it easier than ever to fine-tune for exceptional performance. With PaliGemma 2, these models can see, understand, and interact with visual input, opening up a world of new possibilities.

What’s new in PaliGemma 2?

Scalable performance: Optimize performance for any task with PaliGemma 2's multiple model sizes (3B, 10B, 28B parameters) and resolutions (224px, 448px, 896px).

Long captioning: PaliGemma 2 generates detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene.

Expanding to new horizons: Our research demonstrates leading performance on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation, as detailed in the technical report.

Upgrading to PaliGemma 2 is a breeze for existing PaliGemma users. It's designed as a drop-in replacement, offering a range of model sizes with immediate performance gains on most tasks without major code modifications. Additionally, its flexibility makes fine-tuning for specific tasks and datasets straightforward, empowering you to tailor its capabilities to your precise needs.

You can learn more about how PaliGemma 2 works, including when to use more parameters and larger resolutions, in our technical report.

Building on the success of PaliGemma

Since its launch, the Gemma family has rapidly grown into a vibrant ecosystem—the Gemmaverse—with tens of thousands of models and applications. This rapid growth is a testament to the community's ingenuity. Early innovations using PaliGemma, such as ColPali's advancements in visual document retrieval, RoboFlow's fine-tuning techniques, and progress in real-time object tracking, demonstrate the expanding potential of the Gemmaverse.

Get started today

Ready to explore the potential of PaliGemma 2? Here's how:

Download models & code: Find the pre-trained models and code on Hugging Face and Kaggle.

Learn & integrate: Dive into our comprehensive documentation and example notebooks to quickly integrate these powerful tools into your projects. For PaliGemma, start with our inference notebook then try fine-tuning with a custom dataset.

Use your preferred framework: Leverage your preferred tools and frameworks, including Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.

We're incredibly excited to see what you create with PaliGemma 2. Join the vibrant Gemma community, share your projects to the Gemmaverse, and let's continue to explore the boundless potential of AI together. Your feedback and contributions are invaluable in shaping the future of these models and driving innovation in the field.