Gemini 多模态功能在实际操作中的 7 个示例

十一月 25, 2024
Anirudh Baddepudi Product Manager Gemini Vision, Google Deepmind
Logan Kilpatrick Senior Product Manager Gemini API and Google AI Studio

通过实际案例探索 Gemini 原生的图像与视频理解功能

Gemini 的原生多模态和长文本处理功能为 NotebookLMGoogle 智能镜头等众多应用程序提供了强大支持,并为开发者解锁了各种新颖的应用。

本博文重点介绍了我们在 Gemini API 的图像与视频输入方面最为期待的一些应用案例,我们希望这些案例能为基于视觉功能的开发工作提供新的思路。所展示的示例基于 Gemini 1.5 Pro,它是我们在图像与视频理解应用方面性能最强的模型,但我们也鼓励大家根据任务的复杂程度尝试 1.5 Flash 和 1.5 Flash-8B(同样推荐我们的实验性模型!)。


# 1. 详细的图像描述

Gemini 模型能够描述图像、回答有关图像的问题,并基于图像进行推理。这些模型会根据提示来调整描述的长度、语气和格式,这样您便可以针对自己的应用场景来调整模型的行为。

# 2. 借助原生视觉功能理解长篇 PDF 文档

Gemini 能够理解并处理超过 1,000 页的 PDF 文档。凭借其原生视觉功能,Gemini 可以准确地转录表格、解读复杂的多列排版布局、理解文档中的图表、草图、示意图以及手写文本,并利用这些文本和视觉信息来执行端到端的任务。

在以下示例中,Gemini 将利用来自一系列 PDF 文档的信息来生成结构化的输出内容和代码。

我们提供了截至 2024 年第三季度(总共 152 页)的 15 份 Alphabet 财报,然后要求 Gemini 从这些文档中提取 Google 的收入数据,创建汇总这些数据的表格,并使用 matplotlib 编写代码以将收入和环比增长的情况进行可视化呈现。

提示

以下是过去 15 个季度的收入报告。

  1. 用 Markdown 格式生成一个表格,展示从“截至 2021 年 3 月 31 日的季度”到“截至 2024 年 9 月 30 日的季度”期间每个季度的总收入情况。一共应有 15 个季度。

2. 为 Google 的每项产品/服务(“Google 搜索”、“YouTube 广告”、“Google 网络”、“Google 订阅、平台与设备”、“Google Cloud”以及“其他项目”)生成一个各季度总收入情况的表格,总共涉及 15 个季度。在输出内容中以多个表格的形式展示(每个表格呈现 4 个季度的数据)。请注意,在 2023 年第三季度及之前,“Google 订阅、平台与设备”被称作“Google 其他”项目。


响应

以下是您需求的表格:
 
**1. 按季度划分的总收入(单位:百万美元)**
 
| 季度截至       | 年份 | 总收入 |
|-------------------|------|--------------|
| 3 月 31 日          | 2021 | 55,314        |
| 6 月 30 日           | 2021 | 61,880        |
| 9 月 30 日      | 2021 | 65,118        |
| 12 月 31 日       | 2021 | 75,325        |
| 3 月 31 日          | 2022 | 68,011        |
| 6 月 30 日           | 2022 | 69,685        |
| 9 月 30 日      | 2022 | 69,092        |
| 12 月 31 日       | 2022 | 76,048        |
| 3 月 31 日          | 2023 | 69,787        |
| 6 月 30 日           | 2023 年 | 74,604        |
| 9 月 30 日      | 2023 | 76,693        |
| 12 月 31 日       | 2023 | 86,310        |
| 3 月 31 日          | 2024 | 80,539        |
| 6 月 30 日           | 2024 | 84,742        |
| 9 月 30 日      | 2024 | 88,268        |
 
 
 
**2. 按 Google 产品/服务划分的收入(单位:百万美元)**
 
**2021 第一季度至第四季度**
 
| 季度截至  | Google 搜索 | YouTube 广告 | Google 网络 | Google 其它 | Google Cloud | 其他项目|
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| 2021 年 3 月 31 日 | 31,879       | 6,005      | 6,800        | 6,494       | 4,047       | 198       |
| 2021 年 6 月 30 日  | 35,845       | 7,002      | 7,597        | 6,623       | 4,628       | 192       |
| 2021 年 9 月 30 日 | 37,926       | 7,205      | 7,999        | 6,754       | 4,990       | 182       |
| 2021 年 12 月 31 日  | 43,301       | 8,633      | 9,305        | 8,161       | 5,541       | 181       |
 
**2022 第一季度至第四季度**
 
| 季度截至  | Google 搜索 | YouTube 广告 | Google 网络 | Google 其它 | Google Cloud | 其他项目|
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| 2022 年 3 月 31 日 | 39,618       | 6,869      | 8,174        | 6,811       | 5,821       | 440       |
| 2022 年 6 月 30 日  | 40,689       | 7,340      | 8,259        | 6,553       | 6,276       | 193       |
| 2022 年 9 月 30 日 | 39,539       | 7,071      | 7,872        | 6,895       | 6,868       | 209       |
| 2022 年 12 月 31 日  | 42,604       | 7,963      | 8,475        | 8,796       | 7,315       | 226       |
 
 
**2023 第一季度至第四季度**
 
| 季度截至  | Google 搜索 | YouTube 广告 | Google 网络 | Google 其它 | Google Cloud | 其他项目 |
|----------------|--------------|------------|---------------|-------------|-------------|-----------|
| 2023 年 3 月 31 日 | 40,359       | 6,693      | 7,496        | 7,413       | 7,454       | 288       |
| 2023 年 6 月 30 日  | 42,628       | 7,665      | 7,850        | 8,142       | 8,031       | 285       |
| 2023 年 9 月 30 日 | 44,026       | 7,952      | 7,669        | 8,339       | 8,411       | 297       |
| 2023 年 12 月 31 日  | 48,020       | 9,200      | 8,297        | 10,794      | 9,192       | 657       |
 
**2024 第一季度至第三季度**
 
| 季度截至  | Google 搜索 | YouTube 广告 | Google 网络 | 订阅、平台与设备 | Google Cloud | 其他项目 |
|----------------|--------------|------------|---------------|-----------------------------------|-------------|-----------|
| 2024 年 3 月 31 日 | 46,156       | 8,090       | 7,413        | 8,739      | 9,574       | 495       |
| 2024 年 6 月 30 日  | 48,509       | 8,663       | 7,444        | 9,312      | 10,347      | 365       |
| 2024 年 9 月 30 日 | 49,385       | 8,921       | 7,548        | 10,656     | 11,353      | 388       |

提示:利用生成的表格编写 Matplotlib 代码,以柱状图的形式将这些收入情况进行可视化呈现。在 x 轴上绘制时间(按时间顺序排列 15 个季度),在 y 轴上绘制每个时间段的总收入以及每项产品和服务的收入。不要给用户留下需要填补的代码部分。不考虑“其他项目”,并且要确保各柱形间距相等。

Rendered chart using the model’s code output

提示:很好!现在,您能否生成一个展示各个产品领域收入环比增长情况的图表?x 轴应代表时间(15 个季度中的每一个季度),y 轴应代表收入增长百分比,图表还上应包含 6 条线。即便变量值之前已经定义过,在代码中依然要将其保留。不考虑“其他项目”。

收入环比增长的定义为:(本季度收入-上一季度收入)/上一季度收入。

#3. “现实世界”的文档推理

您可以使用 Gemini 1.5 模型来理解并从“现实世界”的各类文档中提取信息,如收据、标签、标识牌、便条、白板草图、个人记录等。以下是一个 Gemini 模型从一张收据图像中提取用户自定义字段的值,并将其作为一个 JSON 对象返回的示例。

在 Google AI Studio 中的提示

#4. 网页数据提取

Gemini 模型能够从网页的屏幕截图中提取数据,并以 JSON 等结构化输出格式返回这些数据。这让它能如人类视角一般实时获取页面内容(包括网页上的图像和视频),从而使网络数据 API 和浏览代理等应用成为可能。

以下是一个 Gemini 将来自此 Google Play 网页的信息转换为一种结构化输出格式的示例。

example of Gemini extracting contents

提示:从此网页中提取每一本图书的相关信息,并返回一个 JSON 对象列表,其中每个 JSON 对象包含

  1. 图书名称
  2. 作者
  3. 星数
  4. 价格

仅输出您在网页中看到的信息

[
    {
        "name": "Warriors: A Starless Clan #6",
        "author": "Erin Hunter",
        "stars": null,
        "price": "$14.95"
    },
    {
        "name": "In Too Deep: A Reacher Novel",
        "author": "Lee Child",
        "stars": 4.7,
        "price": "$16.95"
    },
    {
        "name": "The Waiting: A Ballard and Bosch Novel",
        "author": "Michael Connelly",
        "stars": 4.8,
        "price": "$16.95"
    },
    {
        "name": "The Fury of the Gods",
        "author": "John Gwynne",
        "stars": 5.0,
        "price": "$29.95"
    },
    {
        "name": "Raised by Wolves: Patterson's Greatest Small-Town Thriller",
        "author": "James Patterson",
        "stars": null,
        "price": "$14.95"
    },
    {
        "name": "Iron Flame (1 of 2)",
        "author": null,
        "stars": 4.6,
        "price": "$27.95"
    },
    {
        "name": "The Message",
        "author": "Ta-Nehisi Coates",
        "stars": 4.9,
        "price": "$14.95"
    },
    {
        "name": "Revenge of the Tipping Point: Overstories, Superspreaders",
        "author": "Malcolm Gladwell",
        "stars": 4.6,
        "price": "$14.95"
    },
    {
        "name": "Counting Miracles: A Novel",
        "author": "Nicholas Sparks",
        "stars": 4.3,
        "price": "$16.95"
    },
    {
        "name": "Nexus: A Brief History of Information Networks from...",
        "author": "Yuval Noah Harari",
        "stars": 4.2,
        "price": "$19.95"
    },
    {
        "name": "Wind and Truth: Book Five of...",
        "author": "Brandon Sanderson",
        "stars": null,
        "price": "$38.45"
    },
    {
        "name": "Passions in Death: An Eve...",
        "author": "J.D. Robb",
        "stars": 4.5,
        "price": "$14.95"
    },
    {
        "name": "Sunrise on the Reaping (A Hunger Games Novel)",
        "author": "Suzanne Collins",
        "stars": null,
        "price": "$19.95"
    },
    {
        "name": "Michael Vey 10: The Colony",
        "author": "Richard Paul Evans",
        "stars": null,
        "price": "$14.95"
    },
    {
        "name": "The Echo Machine: How Right-Wing Extremism...",
        "author": "David Pakman",
        "stars": null,
        "price": "$24.95"
    },
    {
        "name": "The Perfect Crime (A Jessie...",
        "author": null,
        "stars": null,
        "price": "$14.95"
    },
    {
        "name": "Peppermint Peril (A Penny...",
        "author": null,
        "stars": null,
        "price": "$4.99"
    },
    {
        "name": "Lethal Prey",
        "author": "John Sandford",
        "stars": null,
        "price": "$14.95"
    },
    {
        "name": "The Song of Achilles: A Novel",
        "author": "Madeline Miller",
        "stars": 4.7,
        "price": "$3.99"
    },
    {
        "name": "A People's History of the United States",
        "author": "Howard Zinn",
        "stars": 4.1,
        "price": "$4.99"
    },
    {
        "name": "Termination Shock: A Novel",
        "author": "Neal Stephenson",
        "stars": 4.3,
        "price": "$4.99"
    },
    {
        "name": "Mere Christianity",
        "author": "C.S. Lewis",
        "stars": 4.8,
        "price": "$4.99"
    },
        {
        "name": "Local Woman Missing",
        "author": "Mary Kubica",
        "stars": 4.6,
        "price": "$6.99"
    },
    {
        "name": "Murder on the Orient Express: A Hercule Poirot Mystery",
        "author": "Agatha Christie",
        "stars": 4.4,
        "price": "$4.99"
    },
        {
        "name": "The Mountains Sing",
        "author": "Nguyen Phan Que Mai",
        "stars": 1.5,
        "price": "$32.30"
    },
    {
        "name": "The Bands of Mourning",
        "author": "Brandon Sanderson",
        "stars": 4.9,
        "price": "$19.95"
    },
    {
        "name": "The Institute: A Novel",
        "author": "Stephen King",
        "stars": 4.7,
        "price": "$16.95"
    },
        {
        "name": "Tom Lake: A Novel",
        "author": "Ann Patchett",
        "stars": 4.4,
        "price": "$16.95"
    },
    {
        "name": "All the Sinners Bleed: A Novel",
        "author": "S.A. Cosby",
        "stars": 4.7,
        "price": "$14.95"
    },
    {
        "name": "The Black Prism",
        "author": "Brent Weeks",
        "stars": 4.9,
        "price": "$35.99"
    },
    {
        "name": "The Hundred Years' War on Palestine: A History of Settle...",
        "author": "Rashid Khalidi",
        "stars": 4.6,
        "price": "$14.95"
    },
    {
        "name": "War",
        "author": "Bob Woodward",
        "stars": 3.7,
        "price": "$14.95"
    },
    {
        "name": "A People's History of the United States",
        "author": "Howard Zinn",
        "stars": 4.1,
        "price": "$4.99"
    },
    {
        "name": "Confronting the Presidents: No Spin Assessments from...",
        "author": "Bill O'Reilly & Martin Dugard",
        "stars": 4.8,
        "price": "$19.95"
    },
        {
        "name": "Chaos: Charles Manson, the CIA, and the Secret History o...",
        "author": null,
        "stars": 4.8,
        "price": "$29.95"
    },
        {
        "name": "The Demon of Unrest: A Saga of Hubris, Heartbreak, and...",
        "author": "Erik Larson",
        "stars": 3.1,
        "price": "$19.95"
    },
        {
        "name": "A Hidden Secret: A Kate Burkholder Short Story",
        "author": "Linda Castillo",
        "stars": 4.8,
        "price": "$1.99"
    },
    {
        "name": "MrBallen Presents: Strange, Dark & Mysterious: The...",
        "author": null,
        "stars": 5.0,
        "price": "$13.95"
    },
{
        "name": "Angle of Investigation: Three Harry Bosch Stories",
        "author": "Michael Connelly",
        "stars": 4.4,
        "price": "$12.95"
    },
    {
        "name": "Cross Kill: An Alex Cross Story",
        "author": "James Patterson",
        "stars": 4.7,
        "price": "$14.95"
    },
{
        "name": "Detective Cross",
        "author": "James Patterson",
        "stars": 4.6,
        "price": "$14.95"
    },
    {
        "name": "Blood Brothers: A Short Story Exclusive",
        "author": "James Rollins",
        "stars": 5.0,
        "price": "$9.95"
    }
 
]

在 Google AI Studio 中的提示


#5. 目标检测

您可以使用 Gemini 来检测图像中的目标,并为其生成边界框坐标。Gemini 的目标检测功能在将模型的响应与图像建立视觉关联方面具有显著作用,并且当需要依据用户定义的标准来开展推理及查找目标时,它比专用模型更有价值。

以下是使用 Gemini 1.5 Pro 进行目标检测的示例:

Gemini 会检测图像中的各个对象,并在其响应中输出边界框坐标。以下图像展示了此类边界框(通过使用 colab 笔记本)。

# 6. 视频摘要与转录

Gemini 能够处理时长最长达 90 分钟的视频,包括视频帧和音频两部分。通过使用 Gemini,您可以进行以下操作:

  1. 生成视频的转录与摘要
  2. 从日常场景的录像中提取结构化数据
  3. 就视频中的信息提问
  4. 识别并检索视频中的重要时刻

以下是使用 Gemini 生成视频摘要的示例:

提示:为该视频生成能让高中生理解的技术讲座笔记。请务必遵循以下指示:

  1. 结合音频和每张幻灯片图像中展示的信息来撰写笔记。当用到幻灯片中的信息时,需清晰注明,并留意幻灯片上提供的图表并加以描述。
  2. 包括讲座各部分的明确章节
  3. 完整地涵盖视频自始至终的全部内容。

响应:

Here are the technical lecture notes you requested:
 
# **Achieving Rapid Response Times in Large Online Services**
Jeff Dean, Google
 
## **Introduction**
Rapid response times of web applications are important for making them more interactive, fluid, and easy to use for the user. It’s often challenging to keep web apps performing quickly when there’s a large fanout, or the number of servers that need to be contacted in order to fulfill the user’s request. This can be seen in Google Search, where the results page often requires information from thousands of servers to create. This is made even more challenging when Google services are run in a shared environment, or on a large cluster of servers where many different services can be performed. This allows for an array of network issues like traffic congestion, background activity, and spikes in foreground activity for other services being carried out on the same server cluster. 
 
When the latency of applications under these conditions is inspected, they exhibit what's referred to as long-tail latencies. Jeff used an amusing picture of himself on an African safari getting his shirt pulled by a cheetah to illustrate this concept. As he explained, long tail latency means that when you measure the latency of your application and find that it performs quickly on average, the 99th percentile latency could be very long. For example, if your server has a 10 ms average response time and a 99th percentile latency of 1 second, then if you have to get data from only one of those servers, 1% of requests will take more than a second. But when you have to get a response from 100 servers, 63% of your requests will take over a second, since at least one of the servers is likely to be experiencing this high latency.
 
## **Basic Latency Reduction Techniques**
 
Some common ways of dealing with these issues are:
 
* Differentiated service classes: Prioritizing interactive requests and their network traffic much higher than background requests, as this is less likely to affect the user experience if these requests lag.
* Reduce head-of-line blocking: Dividing large requests into many smaller requests to prevent lag for higher-priority requests waiting behind them. 
* Manage expensive background activities: Rate-limiting background activities or delaying them until there’s less traffic on the servers, as these are usually not directly associated with a user request.
 
## **Fault Tolerance vs. Tolerating Variability**
 
The speaker drew an analogy to fault tolerance, a common hardware technique where unreliable parts of the system, like hard drives or a computer's power supply, are used to create a whole reliable system. By analogy, he wants to use unpredictable components that vary greatly in performance to create a predictable and high-performing system. Jeff pointed out that while both fault tolerance and tolerating variability use extra resources, the difference between the two is in the timescale of their variability. The issues that fault tolerance measures are on a scale of tens or hundreds of events per day, while latency tolerance measures thousands of events per second.  
 
## **Latency Tolerating Techniques**
 
Here are two techniques that Jeff describes for minimizing variability in latency:
 
### Cross Request Adaptation
 
* Collect statistics on the system. This could include latency rates, performance of backends, etc.
* Take action to improve the latency of future requests, for example, by load balancing.
* Timescale for these kinds of actions are generally on the order of tens of seconds to minutes. 
 
### Within-Request Adaptation
 
* Within a single high-level request, cope with slow subsystems.
* Timescale for these kinds of actions are generally immediate, while the user is waiting for a request to be fulfilled. 
 
## **Fine-Grained Dynamic Partitioning**
 
One cross request adaption technique that Jeff discussed was fine-grained dynamic partitioning. Normally, if you have ‘n’ servers, you could simply divide the workload into ‘n’ equal pieces, and each server can deal with one piece each, assuming that there’s no shared environment where other things can happen. But once you have a shared environment, the load becomes unpredictable and can result in a server getting overloaded. In the case of a shared environment, it’s recommended to have a server dealing with 10–100 different pieces of work. This allows for very fine-grain load balancing, because if one server is overloaded, one of those pieces of work can be assigned to another server. Another reason for doing this is that it speeds up failure recovery, because when a server dies, whatever it was responsible for is distributed to other machines, and if the workload has been divided into ‘n’ smaller tasks, this recovery process can happen in ‘n’ separate ways simultaneously.
 
## **Selective Replication**
 
Another technique often used by Google is called selective replication, where heavily-used pieces of information in the system are copied to other server clusters. This can be static, where the number of copies is fixed, or dynamic, where the number of copies of a piece of information is increased or decreased depending on the amount of traffic there is in requests associated with that information.
 
## **Latency-Induced Probation**
 
A third technique that Jeff described for dealing with unpredictable latency and interference effects from shared services was what he calls latency-induced probation, or the concept of removing capacity under load to improve latency. The steps for this are:
 
* Recognize that a server is slow to respond, even if it is a high priority server.
* Make a copy of the data in question on another server.
* Send a “shadow stream” of requests to the slow server. These requests are similar to “canary requests” in that they serve as a check to make sure the server is functioning. 
* Once the latency of the slow server has gone down and the “canary” checks show it working, return it to service.
 
## **Backup Requests**
 
Another technique for minimizing latency variability is the use of backup requests, where a client sends a copy of the same request to two or more server clusters in order to improve latency. If one of the servers selected returns the data faster, the client sends a cancellation request for the duplicate request in the other server queue, if it’s possible to maintain information about where the original request was sent. However, the disadvantage of this is that it can double the processing required if two servers begin processing the request at about the same time. In the latter case, the client needs to check if the issue of simultaneous processing occurred, and if so, to send only one copy of the requested data. 
 
The speaker then measured the improvement in latency using two different systems. The first was a loaded server cluster where data was replicated in two in-memory servers, and 1000 requests were spread across 100 tablets. The speaker measured the time it took for all 1000 keys to be retrieved. The second measurement used an almost completely idle system, but with the same parameters: data was replicated in two in-memory servers, 1000 requests were sent across 100 tablets, and the total retrieval time for all 1000 keys was measured.  In both cases, backup requests reduced latency dramatically. The results for both loaded and idle servers, respectively, are shown in the tables below:
 
### Loaded cluster results:
 
| Policy        | Avg      | Std Dev | 95%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:------:|:--------:|
| No backups    | 33 ms   | 1524 ms| 24 ms  | 52 ms  | 994 ms   |
| Backup after 10 ms | 14 ms   | 4 ms    | 20 ms  | 23 ms  | 50 ms    |
| Backup after 50 ms | 16 ms   | 12 ms   | 57 ms  | 63 ms  | 68 ms    |
 
### Idle cluster results:
 
| Policy        | 50%ile  | 90%ile | 99%ile | 99.9%ile |
| ------------- |:--------:|:-------:|:------:|:--------:|
| No backups    | 19 ms   | 38 ms  | 67 ms  | 98 ms    |
| Backup after 2 ms | 16 ms   | 28 ms  | 38 ms  | 51 ms    |
 
## Conclusion
 
These techniques can make online services more responsive and can dramatically cut down on processing time and costs.

在 Google AI Studio 中的提示


# 7. 视频提取

Gemini 能从视频中提取信息,并以结构化格式输出列表、表格和 JSON 对象等形式。这对于在零售、交通和家庭安保等领域检测实体、从屏幕录制内容中提取非结构化数据,或是编目等工作都颇具实用价值。

在 Google AI Studio 中的提示
请注意: 由于当前的采样率为 1 FPS,该模型偶尔可能会遗漏视频中的某些内容。我们正在努力为视频尽快启用更高帧率的采样功能。因此就目前的情况而言,我们建议在有需要的情况下对于这些使用案例的输出结果进行验证,同时,我们仍然希望能让大家提前了解我们正在努力的方向,以及我们的模型在未来数月内的发展程度。


参考资料

若要开始基于 Gemini API 的视觉开发之旅,请访问我们的开发者指南以轻松上手。您也可以加入我们的开发者论坛,结识其他开发者并讨论您的应用案例,还有机会获得来自 Gemini API 团队成员的专业协助与指导。