PaliGemma 2 mix の紹介: さまざまなタスクに対応した視覚言語モデル

2025年2月19日

Omar Sanseviero Staff Developer Relations Engineer

Andreas Steiner Staff Software Engineer

昨年 12 月、Gemma ファミリーの視覚言語モデルをアップグレードした PaliGemma 2 をリリースしました。このリリースには、いくつかのサイズ（3B、10B、28B パラメータ）のトレーニング済みチェックポイントが含まれており、画像セグメンテーション、短い動画のキャプション付け、科学的な質問への回答、高性能のテキスト関連タスクなど、幅広い視覚言語のタスクやドメインで簡単にファインチューニングできます。

本日は、新たな PaliGemma 2 mix チェックポイントのリリースについてお知らせします。PaliGemma 2 mix は、さまざまなタスク向けにチューニングしたモデルです。モデルの機能を直接確認することも、一般的なユースケースにすぐに活用することもできます。

PaliGemma 2 mix の新機能

1 つのモデルで複数のタスクに対応: PaliGemma 2 mix は、短いキャプションの生成や長いキャプションの生成、光学文字認識（OCR）、画像に関する質問への回答、物体の検出やセグメンテーションといったタスクを解決できます。

デベロッパーフレンドリーなサイズ: いくつかのモデルサイズ（3B、10B、28B パラメータ）と解像度（224 px と 448 px）があるので、ニーズに最適なモデルを使えます。

お好みのフレームワークで利用: Hugging Face Transformers、Keras、PyTorch、JAX、Gemma.cpp など、お好みのツールとフレームワークをお使いください。

すでにオリジナルの PaliGemma mix チェックポイントを使っている方は、直接 PaliGemma 2 にアップグレードでき、変更を加える必要はありません。モデルは、プロンプトに応じてさまざまなタスクを実行します。公式ドキュメントでさまざまなプロンプトタスクの構文を確認しましょう。PaliGemma 2 の開発手法についての詳細は、テクニカルレポートをご覧ください。

検出

タスク: 検出（PaliGemma-2-3b-mix-224）
入力: "detect android\n"

$Input - "detect android\n"$

結果:

Result in PaliGemma 2 Mix: A large, green Android figure stands on a white platform, enclosed by a red box. The word "android" is written in red above the figure.

複数オブジェクト検出

タスク: 複数オブジェクト検出（PaliGemma-2-3b-mix-224）
入力: “detect chair ; table\n”

Multiple object detection of items in a dining room

結果:

A wooden table and chair are in the foreground. Additional tables and chairs can be seen in the background within a room with a bee patterned wall and wooden floors. Labeled boxes highlight the furniture with the text "table" and "chair."

タスク: 複数オブジェクト検出（PaliGemma-2-3b-mix-224）
入力: "detect food ; plate ; bowl\n"

Plates and bowls of food on a wooden table

結果:

Plates and bowls of food on a wooden table labeled with boxes that accurately identify "plate", "bowl" and "food"

光学式文字認識（OCR）

タスク: 複数オブジェクト検出（PaliGemma-2-3b-mix-224）
入力: "ocr\n"

結果:

Japanese Kanji reads: Downlight, Dining Room, Kitchen, Living Room, Bathroom/Dressing Room]

セグメンテーション

タスク: セグメンテーション（PaliGemma-2-3b-mix-224） [ImageFX で生成した画像]
入力: "segment cat\n"

Image of a cat looking at the camera behind a wooden sign that reads 'Hello PaliGemma 2' generated by ImageFX

結果:

highlighted image of a cat looking at the camera behind a wooden sign that reads 'Hello PaliGemma 2' generated by ImageFX

質問への回答

タスク: 質問への回答（PaliGemma2-mix-3b-448） [ImageFX で生成した画像]
入力: “answer en where is the cow standing?\n"

A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

結果: beach

キャプション付け

入力: “caption en\n”

結果: a cow standing on a beach next to a sign that says warning dangerous rip current.

光学式文字認識（OCR）

結果:

WARNING

DANGEROUS

RIP CURRENT

検出

入力: “detect cow\n”

結果:

A cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking. A red box outlines the cow, with a label that reads "cow"

セグメンテーション

入力: “segment cow\n”

結果:

A highlighted cow standing on the beach next to a yellow sign that reads 'Warning Dangerous Rip Current' with an illustration of a large wave breaking.

キャプション付け

タスク: キャプション付け（PaliGemma 2-mix-10b-448）
入力: “caption en\n”

結果: A cow standing on a beach next to a warning sign.

光学式文字認識（OCR）

タスク: "ocr\n"

結果:

WARNING DANGEROUS

RIP CURRENT

さっそく使ってみましょう

PaliGemma 2 の可能性を試してみたい方は、以下の方法で mix モデルの機能を確認できます。

数回のクリックで mix モデルを試す: Hugging Face のデモで、mix モデルの機能を直接ご覧ください。

モデルをダウンロードする: Kaggle と Hugging Face から mix モデルの重みにアクセスできます。

モデルの実行方法を学習する: Keras の推論ノートブックを Google Colab やローカルで直接試すことができます。

数回のクリックでデプロイとチューニングを行う: Vertex Model Garden で PaliGemma 2 mix を直接使うことができます。

PaliGemma 2 mix は複数のタスクで強力なパフォーマンスを発揮しますが、独自のタスクやドメインでファインチューニングすると、最良の結果を得ることができます。その方法については、総合ドキュメントをご覧ください。Keras と JAX の公式サンプルノートブックや、Hugging Face transformers の例も確認できます。皆さんの作品を見るのを楽しみにしています！