Gemini のネイティブ マルチモーダルとロング コンテキスト機能は、NotebookLM、Google レンズなどのたくさんのアプリケーションを支え、デベロッパーのためにさまざまな新しいアプリケーションの可能性を開いています。
このブログ記事では、Gemini API に画像と動画を入力する事例の中から、特にすばらしいユースケースを紹介します。これを参考に、視覚機能を使った新しいアイデアが生まれることを期待しています。ここで紹介する例は、画像や動画を解釈するアプリケーションに利用できる私たちのモデルの中で最も強力な Gemini 1.5 Pro を使ったものですが、タスクの複雑さによっては、1.5 Flash や 1.5 Flash-8B (または私たちの試験運用版モデル)を試すこともお勧めします。
Gemini モデルは、画像について説明したり、質問に答えたり、推論したりすることができます。モデルは、プロンプトに応じて説明の長さ、調子、形式を調整できるので、ユースケースに合わせてモデルの動作を調整できます。
Gemini は、1,000 ページ以上の PDF ドキュメントを理解し、処理することができます。ネイティブの視覚機能により、表を正確に文字起こしすることも、複雑なマルチカラム レイアウトを解釈することも、ドキュメント内のチャート、スケッチ、図、手書きテキストを理解することもできます。さらに、そのテキストと視覚情報を使ってエンドツーエンドのタスクを実行することができます。
次に示すのは、Gemini が一連の PDF に含まれる情報を使って構造化された出力とコードを生成する例です。
ここでは、2024 年第 3 四半期までの 15 回の Alphabet 決算発表(合計 152 ページ)を Gemini に与え、そこから Google の収益値を抽出してデータを整理した表を作ってもらうとともに、matplotlib を使って収益と四半期ごとの成長を可視化するコードを書いてもらいました。
プロンプト:
These are earnings reports over the last 15 quarters.(直近 15 回の四半期収益レポートがあります。)
2. Generate a table of total revenues for each Google product/service -- "Google Search", "Youtube ads", "Google Network", "Google subscriptions, platforms and devices", "Google Cloud" and "Other Bets" -- for each of the quarters (15 in total). Show this in multiple tables (4 quarters for each table) in your output. Note that Google subscriptions, platforms and devices was called 'Google other' third quarter 2023 and before.(Google プロダクト / サービスごと、すなわち「Google 検索」、「Youtube 広告」、「Google ネットワーク」、「Google サブスクリプション、プラットフォームとデバイス」、「Google Cloud」、「その他」のそれぞれの合計収益の表を四半期ごと(合計 15)に作ってください。出力は複数の表にしてください(1 つの表に 4 つの四半期)。なお、Google サブスクリプション、プラットフォームとデバイスは、2023 年の第 3 四半期以前は「Google その他」と呼ばれていました。)
応答:
Here are the tables you requested:
**1. Total Revenue by Quarter (in millions USD)**
| Quarter Ended | Year | Total Revenue |
|-------------------|------|--------------|
| March 31 | 2021 | 55,314 |
| June 30 | 2021 | 61,880 |
| September 30 | 2021 | 65,118 |
| December 31 | 2021 | 75,325 |
| March 31 | 2022 | 68,011 |
| June 30 | 2022 | 69,685 |
| September 30 | 2022 | 69,092 |
| December 31 | 2022 | 76,048 |
| March 31 | 2023 | 69,787 |
| June 30 | 2023 | 74,604 |
| September 30 | 2023 | 76,693 |
| December 31 | 2023 | 86,310 |
| March 31 | 2024 | 80,539 |
| June 30 | 2024 | 84,742 |
| September 30 | 2024 | 88,268 |
**2. Revenue by Google Product/Service (in millions USD)**
**2021 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2021 | 31,879 | 6,005 | 6,800 | 6,494 | 4,047 | 198 |
| June 30, 2021 | 35,845 | 7,002 | 7,597 | 6,623 | 4,628 | 192 |
| Sept 30, 2021 | 37,926 | 7,205 | 7,999 | 6,754 | 4,990 | 182 |
| Dec 31, 2021 | 43,301 | 8,633 | 9,305 | 8,161 | 5,541 | 181 |
**2022 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2022 | 39,618 | 6,869 | 8,174 | 6,811 | 5,821 | 440 |
| June 30, 2022 | 40,689 | 7,340 | 8,259 | 6,553 | 6,276 | 193 |
| Sept 30, 2022 | 39,539 | 7,071 | 7,872 | 6,895 | 6,868 | 209 |
| Dec 31, 2022 | 42,604 | 7,963 | 8,475 | 8,796 | 7,315 | 226 |
**2023 Q1 - Q4**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Google Other | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| March 31, 2023 | 40,359 | 6,693 | 7,496 | 7,413 | 7,454 | 288 |
| June 30, 2023 | 42,628 | 7,665 | 7,850 | 8,142 | 8,031 | 285 |
| Sept 30, 2023 | 44,026 | 7,952 | 7,669 | 8,339 | 8,411 | 297 |
| Dec 31, 2023 | 48,020 | 9,200 | 8,297 | 10,794 | 9,192 | 657 |
**2024 Q1 - Q3**
| Quarter Ended | Google Search | YouTube Ads | Google Network | Subscriptions, Platforms, & Devices | Google Cloud | Other Bets |
|----------------|--------------|------------|---------------|-----------------------------------|-------------|-----------|
| March 31, 2024 | 46,156 | 8,090 | 7,413 | 8,739 | 9,574 | 495 |
| June 30, 2024 | 48,509 | 8,663 | 7,444 | 9,312 | 10,347 | 365 |
| Sept 30, 2024 | 49,385 | 8,921 | 7,548 | 10,656 | 11,353 | 388 |
プロンプト: Using the tables generated, write matplotlib code to visualize these revenues, in bar chart format. On the x-axis, plot time (each of the 15 quarters in chronological order). On the y-axis, plot revenues -- total revenue and for each product and service for each time horizon. Do not leave code for the user to fill in. Leave out Other Bets, and ensure that the bars have equal spacing.(生成した表を使って、これらの収益を棒グラフ形式で視覚化する matplotlib コードを書いてください。x 軸は時間(15 回のそれぞれの四半期を時系列で)、y 軸は収益とし、合計収益とそれぞれの回のそれぞれのプロダクトとサービスを並べてください。ユーザーが記述するコードを残さないでください。その他は除き、バーの間隔は等しくしてください。)
プロンプト: Awesome! Now, can you generate a plot showing quarter on quarter revenue growth for each of the product areas? The x axis should be time (each of the 15 quarters) and the y axis should be the revenue percentage growth, and there should be 6 lines on the chart. Include variable values in the code even if they were defined previously. Exclude other bets.(すばらしいですね!では、それぞれのプロダクト領域の四半期ごとの収益の増加を示すグラフを生成できますか?x 軸は時間(15 回の各四半期)、y 軸は収益増加率とし、グラフに 6 本の線を含めてください。変数値がすでに定義されていても、コードに含めてください。その他は除いてください。)
Quarter on Quarter revenue growth is defined as ((revenue for current quarter)-(revenue for previous quarter))/(revenue for previous quarter).(四半期ごとの収益増加は、((現在の四半期の収益)-(前の四半期の収益))/(前の四半期の収益) と定義します。)
Gemini 1.5 モデルを使うと、領収書、ラベル、署名、メモ、ホワイトボードの図、個人記録などの「現実世界」の文書の情報を解釈して抽出することができます。Gemini が領収書の画像からユーザー定義フィールドの値を抽出し、それを JSON オブジェクトとして返す例を示します。
Gemini モデルは、ウェブページのスクリーン キャプチャからデータを抽出し、そのデータを JSON などの構造化された出力形式で返すことができます。ウェブページの画像や動画を含めて、人間が見るようにリアルタイムでページのコンテンツにアクセスできるので、ウェブデータ API やブラウジング エージェントなどのアプリケーションを実現できます。
次に示す例では、Gemini がこちらの Google Play のウェブページの情報を、構造化された出力形式に変換しています。
プロンプト: Extract each of the books from this webpage, and return a list of JSON objects, where each contains(このウェブページから本を抽出し、次の項目を含む JSON オブジェクトのリストで返してください。)
Only output information that you see in the webpage(ウェブページにある情報のみを出力してください)
[
{
"name": "Warriors: A Starless Clan #6",
"author": "Erin Hunter",
"stars": null,
"price": "$14.95"
},
{
"name": "In Too Deep: A Reacher Novel",
"author": "Lee Child",
"stars": 4.7,
"price": "$16.95"
},
{
"name": "The Waiting: A Ballard and Bosch Novel",
"author": "Michael Connelly",
"stars": 4.8,
"price": "$16.95"
},
{
"name": "The Fury of the Gods",
"author": "John Gwynne",
"stars": 5.0,
"price": "$29.95"
},
{
"name": "Raised by Wolves: Patterson's Greatest Small-Town Thriller",
"author": "James Patterson",
"stars": null,
"price": "$14.95"
},
{
"name": "Iron Flame (1 of 2)",
"author": null,
"stars": 4.6,
"price": "$27.95"
},
{
"name": "The Message",
"author": "Ta-Nehisi Coates",
"stars": 4.9,
"price": "$14.95"
},
{
"name": "Revenge of the Tipping Point: Overstories, Superspreaders",
"author": "Malcolm Gladwell",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "Counting Miracles: A Novel",
"author": "Nicholas Sparks",
"stars": 4.3,
"price": "$16.95"
},
{
"name": "Nexus: A Brief History of Information Networks from...",
"author": "Yuval Noah Harari",
"stars": 4.2,
"price": "$19.95"
},
{
"name": "Wind and Truth: Book Five of...",
"author": "Brandon Sanderson",
"stars": null,
"price": "$38.45"
},
{
"name": "Passions in Death: An Eve...",
"author": "J.D. Robb",
"stars": 4.5,
"price": "$14.95"
},
{
"name": "Sunrise on the Reaping (A Hunger Games Novel)",
"author": "Suzanne Collins",
"stars": null,
"price": "$19.95"
},
{
"name": "Michael Vey 10: The Colony",
"author": "Richard Paul Evans",
"stars": null,
"price": "$14.95"
},
{
"name": "The Echo Machine: How Right-Wing Extremism...",
"author": "David Pakman",
"stars": null,
"price": "$24.95"
},
{
"name": "The Perfect Crime (A Jessie...",
"author": null,
"stars": null,
"price": "$14.95"
},
{
"name": "Peppermint Peril (A Penny...",
"author": null,
"stars": null,
"price": "$4.99"
},
{
"name": "Lethal Prey",
"author": "John Sandford",
"stars": null,
"price": "$14.95"
},
{
"name": "The Song of Achilles: A Novel",
"author": "Madeline Miller",
"stars": 4.7,
"price": "$3.99"
},
{
"name": "A People's History of the United States",
"author": "Howard Zinn",
"stars": 4.1,
"price": "$4.99"
},
{
"name": "Termination Shock: A Novel",
"author": "Neal Stephenson",
"stars": 4.3,
"price": "$4.99"
},
{
"name": "Mere Christianity",
"author": "C.S. Lewis",
"stars": 4.8,
"price": "$4.99"
},
{
"name": "Local Woman Missing",
"author": "Mary Kubica",
"stars": 4.6,
"price": "$6.99"
},
{
"name": "Murder on the Orient Express: A Hercule Poirot Mystery",
"author": "Agatha Christie",
"stars": 4.4,
"price": "$4.99"
},
{
"name": "The Mountains Sing",
"author": "Nguyen Phan Que Mai",
"stars": 1.5,
"price": "$32.30"
},
{
"name": "The Bands of Mourning",
"author": "Brandon Sanderson",
"stars": 4.9,
"price": "$19.95"
},
{
"name": "The Institute: A Novel",
"author": "Stephen King",
"stars": 4.7,
"price": "$16.95"
},
{
"name": "Tom Lake: A Novel",
"author": "Ann Patchett",
"stars": 4.4,
"price": "$16.95"
},
{
"name": "All the Sinners Bleed: A Novel",
"author": "S.A. Cosby",
"stars": 4.7,
"price": "$14.95"
},
{
"name": "The Black Prism",
"author": "Brent Weeks",
"stars": 4.9,
"price": "$35.99"
},
{
"name": "The Hundred Years' War on Palestine: A History of Settle...",
"author": "Rashid Khalidi",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "War",
"author": "Bob Woodward",
"stars": 3.7,
"price": "$14.95"
},
{
"name": "A People's History of the United States",
"author": "Howard Zinn",
"stars": 4.1,
"price": "$4.99"
},
{
"name": "Confronting the Presidents: No Spin Assessments from...",
"author": "Bill O'Reilly & Martin Dugard",
"stars": 4.8,
"price": "$19.95"
},
{
"name": "Chaos: Charles Manson, the CIA, and the Secret History o...",
"author": null,
"stars": 4.8,
"price": "$29.95"
},
{
"name": "The Demon of Unrest: A Saga of Hubris, Heartbreak, and...",
"author": "Erik Larson",
"stars": 3.1,
"price": "$19.95"
},
{
"name": "A Hidden Secret: A Kate Burkholder Short Story",
"author": "Linda Castillo",
"stars": 4.8,
"price": "$1.99"
},
{
"name": "MrBallen Presents: Strange, Dark & Mysterious: The...",
"author": null,
"stars": 5.0,
"price": "$13.95"
},
{
"name": "Angle of Investigation: Three Harry Bosch Stories",
"author": "Michael Connelly",
"stars": 4.4,
"price": "$12.95"
},
{
"name": "Cross Kill: An Alex Cross Story",
"author": "James Patterson",
"stars": 4.7,
"price": "$14.95"
},
{
"name": "Detective Cross",
"author": "James Patterson",
"stars": 4.6,
"price": "$14.95"
},
{
"name": "Blood Brothers: A Short Story Exclusive",
"author": "James Rollins",
"stars": 5.0,
"price": "$9.95"
}
]
Gemini を使って画像内の物体を検知し、その境界ボックスの座標を生成できます。Gemini の物体検知機能は、モデルの応答を画像に結びつけたい場合に特に便利です。ユーザーが定義する条件に基づいて物体を推論して見つける必要がある場合は、専用のモデルを上回る付加価値を得ることができます。
次に示すのは、Gemini 1.5 Pro を使った物体検知の例です。
Gemini は画像内の各物体を検知し、応答で境界ボックスの座標を出力します。下の画像にその境界ボックスを示します(こちらの colab ノートブックを使っています)。
Gemini は最大 90 分間の動画について、映像のフレームとオーディオの両方を処理できます。Gemini を使うと、次のようなことができます。
次に示すのは、Gemini を使って動画の要約を生成する例です。
プロンプト: Generate technical lecture notes for this video in a manner that would be understandable for a high school student. Make sure to follow these instructions:(この動画の技術講義資料を、高校生が理解できるような形で生成してください。以下の指示に従ってください。)
応答:
Here are the technical lecture notes you requested:
# **Achieving Rapid Response Times in Large Online Services**
Jeff Dean, Google
## **Introduction**
Rapid response times of web applications are important for making them more interactive, fluid, and easy to use for the user. It’s often challenging to keep web apps performing quickly when there’s a large fanout, or the number of servers that need to be contacted in order to fulfill the user’s request. This can be seen in Google Search, where the results page often requires information from thousands of servers to create. This is made even more challenging when Google services are run in a shared environment, or on a large cluster of servers where many different services can be performed. This allows for an array of network issues like traffic congestion, background activity, and spikes in foreground activity for other services being carried out on the same server cluster.
When the latency of applications under these conditions is inspected, they exhibit what's referred to as long-tail latencies. Jeff used an amusing picture of himself on an African safari getting his shirt pulled by a cheetah to illustrate this concept. As he explained, long tail latency means that when you measure the latency of your application and find that it performs quickly on average, the 99th percentile latency could be very long. For example, if your server has a 10 ms average response time and a 99th percentile latency of 1 second, then if you have to get data from only one of those servers, 1% of requests will take more than a second. But when you have to get a response from 100 servers, 63% of your requests will take over a second, since at least one of the servers is likely to be experiencing this high latency.
## **Basic Latency Reduction Techniques**
Some common ways of dealing with these issues are:
* Differentiated service classes: Prioritizing interactive requests and their network traffic much higher than background requests, as this is less likely to affect the user experience if these requests lag.
* Reduce head-of-line blocking: Dividing large requests into many smaller requests to prevent lag for higher-priority requests waiting behind them.
* Manage expensive background activities: Rate-limiting background activities or delaying them until there’s less traffic on the servers, as these are usually not directly associated with a user request.
## **Fault Tolerance vs. Tolerating Variability**
The speaker drew an analogy to fault tolerance, a common hardware technique where unreliable parts of the system, like hard drives or a computer's power supply, are used to create a whole reliable system. By analogy, he wants to use unpredictable components that vary greatly in performance to create a predictable and high-performing system. Jeff pointed out that while both fault tolerance and tolerating variability use extra resources, the difference between the two is in the timescale of their variability. The issues that fault tolerance measures are on a scale of tens or hundreds of events per day, while latency tolerance measures thousands of events per second.
## **Latency Tolerating Techniques**
Here are two techniques that Jeff describes for minimizing variability in latency:
### Cross Request Adaptation
* Collect statistics on the system. This could include latency rates, performance of backends, etc.
* Take action to improve the latency of future requests, for example, by load balancing.
* Timescale for these kinds of actions are generally on the order of tens of seconds to minutes.
### Within-Request Adaptation
* Within a single high-level request, cope with slow subsystems.
* Timescale for these kinds of actions are generally immediate, while the user is waiting for a request to be fulfilled.
## **Fine-Grained Dynamic Partitioning**
One cross request adaption technique that Jeff discussed was fine-grained dynamic partitioning. Normally, if you have ‘n’ servers, you could simply divide the workload into ‘n’ equal pieces, and each server can deal with one piece each, assuming that there’s no shared environment where other things can happen. But once you have a shared environment, the load becomes unpredictable and can result in a server getting overloaded. In the case of a shared environment, it’s recommended to have a server dealing with 10–100 different pieces of work. This allows for very fine-grain load balancing, because if one server is overloaded, one of those pieces of work can be assigned to another server. Another reason for doing this is that it speeds up failure recovery, because when a server dies, whatever it was responsible for is distributed to other machines, and if the workload has been divided into ‘n’ smaller tasks, this recovery process can happen in ‘n’ separate ways simultaneously.
## **Selective Replication**
Another technique often used by Google is called selective replication, where heavily-used pieces of information in the system are copied to other server clusters. This can be static, where the number of copies is fixed, or dynamic, where the number of copies of a piece of information is increased or decreased depending on the amount of traffic there is in requests associated with that information.
## **Latency-Induced Probation**
A third technique that Jeff described for dealing with unpredictable latency and interference effects from shared services was what he calls latency-induced probation, or the concept of removing capacity under load to improve latency. The steps for this are:
* Recognize that a server is slow to respond, even if it is a high priority server.
* Make a copy of the data in question on another server.
* Send a “shadow stream” of requests to the slow server. These requests are similar to “canary requests” in that they serve as a check to make sure the server is functioning.
* Once the latency of the slow server has gone down and the “canary” checks show it working, return it to service.
## **Backup Requests**
Another technique for minimizing latency variability is the use of backup requests, where a client sends a copy of the same request to two or more server clusters in order to improve latency. If one of the servers selected returns the data faster, the client sends a cancellation request for the duplicate request in the other server queue, if it’s possible to maintain information about where the original request was sent. However, the disadvantage of this is that it can double the processing required if two servers begin processing the request at about the same time. In the latter case, the client needs to check if the issue of simultaneous processing occurred, and if so, to send only one copy of the requested data.
The speaker then measured the improvement in latency using two different systems. The first was a loaded server cluster where data was replicated in two in-memory servers, and 1000 requests were spread across 100 tablets. The speaker measured the time it took for all 1000 keys to be retrieved. The second measurement used an almost completely idle system, but with the same parameters: data was replicated in two in-memory servers, 1000 requests were sent across 100 tablets, and the total retrieval time for all 1000 keys was measured. In both cases, backup requests reduced latency dramatically. The results for both loaded and idle servers, respectively, are shown in the tables below:
### Loaded cluster results:
| Policy | Avg | Std Dev | 95%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:------:|:--------:|
| No backups | 33 ms | 1524 ms| 24 ms | 52 ms | 994 ms |
| Backup after 10 ms | 14 ms | 4 ms | 20 ms | 23 ms | 50 ms |
| Backup after 50 ms | 16 ms | 12 ms | 57 ms | 63 ms | 68 ms |
### Idle cluster results:
| Policy | 50%ile | 90%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:--------:|
| No backups | 19 ms | 38 ms | 67 ms | 98 ms |
| Backup after 2 ms | 16 ms | 28 ms | 38 ms | 51 ms |
## Conclusion
These techniques can make online services more responsive and can dramatically cut down on processing time and costs.
Gemini は、動画から情報を抽出し、リストや表、JSON オブジェクトなどの構造化形式で出力できます。カタログ化や、小売、トラフィック、ホーム セキュリティなどの領域でのエンティティの検出、画面録画からの非構造化データ抽出といったタスクを行う場合に、特に便利です。
注: 1 FPS でサンプリングを行っているため、モデルが動画内の項目を見逃す場合があります。現在、動画のサンプリング FPS の向上に取り組んでいます。そのため現時点では、必要に応じてユースケースの出力を検証することをお勧めしますが、モデルの今後の方向性や目指すべき姿は、今後数か月のうちにお知らせする予定です。
Gemini API の視覚機能を使って開発を始めたい方は、デベロッパー ガイドをご覧ください。また、デベロッパー フォーラムに参加すると、他のデベロッパーと交流したり、ユースケースについて話し合ったり、Gemini API チームのメンバーから助言を得たりすることができます。