宣布 Llama 4 作为 Vertex AI 上的模型即服务 (MaaS) 正式发布

2025年4月29日

Ivan Nardini AI/ML Advocate Cloud AI

部署和管理 Llama 4 模型涉及多个步骤：进行复杂的基础设施设置、管理 GPU 可用性、确保可伸缩性以及处理持续的运营开销。如果您能够解决这些挑战，并直接专注于构建您的应用，那会怎样呢？使用 Vertex AI，这些都有可能实现。

我们很高兴地宣布，最新一代的 Meta 开放大型语言模型 Llama 4 现已作为全托管式 API 端点在 Vertex AI 中正式发布 (GA)！除了 Llama 4 之外，我们还宣布在 Vertex AI 中推出 Llama 3.3 70B 托管式 API 的正式版。

与以前的 Llama 模型相比，Llama 4 凭借多模态功能和高效的混合专家 (MoE) 架构达到了新的性能峰值。Llama 4 Scout 比前几代 Llama 模型都要更强大，不仅在处理多模态任务方面具有显著的高效性，而且由于经过优化，因而可在单 GPU 环境中运行。Llama 4 Maverick 是 Meta 目前提供的最智能的模型选项，专为推理、复杂的图像理解和要求严苛的生成任务而设计。

利用 Llama 4 作为全托管式 API 端点，您现在可以利用 Llama 4 的高级推理、编码和指令跟随功能，以及 Vertex AI 的易用性、可伸缩性和可靠性，从而构建更复杂、更具影响力的 AI 应用。

这篇帖子将为您开始使用 Llama 4 作为模型即服务 (MaaS) 提供指导，突出主要优势，向您展示模型的易用性，并介绍成本方面的考虑因素。

在 Vertex AI Model Garden 中探索 Llama 4 MaaS

Vertex AI Model Garden 是您在 Google Cloud 上通过托管式 API 发现和部署基础模型的中心枢纽。它提供精选的 Google 自研模型（如 Gemini）、开源模型和第三方模型，这些模型都可以通过简化后的界面访问。添加 Llama 4 (GA) 作为托管式服务扩展了这一精选系列，可为您提供更大的灵活性。

在 Vertex AI 上以模型即服务 (MaaS) 的形式访问 Llama 4 具有以下优势：

1：零基础设施管理：Google Cloud 负责处理底层基础设施、GPU 配置、软件依赖项、补丁和维护工作。您只需与简单的 API 端点进行交互即可。

2：性能保证：为这些模型分配处理能力，以确保高可用性。

3：企业级安全性和合规性：受益于 Google Cloud 强大的安全性、数据加密、访问控制和合规性认证。

开始使用 Llama 4 MaaS

要在 Vertex AI 上开始使用 Llama 4 MaaS，您只需要导航到 Vertex AI Model Garden 中的 Llama 4 模型卡并接受 Llama 社区许可协议；在未完成此步骤的情况下，您不能调用 API。

在 Model Garden 中接受 Llama 社区许可协议后，在 Vertex AI Model Garden 中找到您希望使用的特定 Llama 4 MaaS 模型（例如，“Llama 4 17B Instruct MaaS”）。请注意其唯一的模型 ID（如 meta/llama-4-scout-17b-16e-instruct-maas），因为您需要在调用 API 时使用此 ID。

然后，您可以使用 ChatCompletion API 直接调用 Llama 4 MaaS 端点。因为 Google Cloud 会管理端点配置，所以您无需针对 MaaS 产品使用单独的“部署”步骤。下面是如何通过适用于 Python 的 ChatCompletion API 来使用 Llama 4 Scout 的示例。

import openai
from google.auth import default, transport
import os
 
# --- Configuration ---
PROJECT_ID = "<YOUR_PROJECT_ID>" 
LOCATION = "us-east5"
MODEL_ID = "meta/llama-4-scout-17b-16e-instruct-maas" 
 
# Obtain Application Default Credentials (ADC) token
credentials, _ = default()
auth_request = transport.requests.Request()
credentials.refresh(auth_request) 
gcp_token = credentials.token
 
# Construct the Vertex AI MaaS endpoint URL for OpenAI library
vertex_ai_endpoint_url = (
    f"https://{LOCATION}-aiplatform.googleapis.com/v1beta1/"
    f"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openapi"
)
 
# Initialize the client to use ChatCompletion API pointing to Vertex AI MaaS
client = openai.OpenAI(
        base_url=vertex_ai_endpoint_url,
        api_key=gcp_token, # Use the GCP token as the API key
    )
 
# Example: Multimodal request (text + image from Cloud Storage)
prompt_text = "Describe this landmark and its significance."
image_gcs_uri = "gs://cloud-samples-data/vision/landmark/eiffel_tower.jpg"
 
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": image_gcs_uri},
            },
            {"type": "text", "text": prompt_text},
        ],
    }
]
 
# Optional parameters (refer to model card for specifics)
max_tokens_to_generate = 1024
request_temperature = 0.7
request_top_p = 1.0
 
# Call the ChatCompletion API
response = client.chat.completions.create(
        model=MODEL_ID, # Specify the Llama 4 MaaS model ID
        messages=messages,
        max_tokens=max_tokens_to_generate,
        temperature=request_temperature,
        top_p=request_top_p,
        # stream=False # Set to True for streaming responses
    )
 
generated_text = response.choices[0].message.content
print(generated_text)
# The image contains...

Python

重要说明：请务必查阅 Vertex AI Model Garden 中的特定 Llama 4 模型卡，其中包含以下方面的重要信息：