Announcing the general availability of Llama 4 MaaS on Vertex AI

2025年4月29日
Ivan Nardini AI/ML Advocate Cloud AI

Deploying and managing Llama 4 models involves multiple steps: navigating complex infrastructure setup, managing GPU availability, ensuring scalability, and handling ongoing operational overhead. What if you could address these challenges and focus directly on building your applications? It’s possible with Vertex AI.

We're thrilled to announce that Llama 4, the latest generation of Meta’s open large language models, is now generally available (GA) as a fully managed API endpoint in Vertex AI! In addition to Llama 4, we’re also announcing the general availability of the Llama 3.3 70B managed API in Vertex AI.

Llama 4 reaches new performance peaks compared to previous Llama models, with multimodal capabilities and a highly efficient Mixture-of-Experts (MoE) architecture. Llama 4 Scout is more powerful than all previous generations of Llama models while also delivering significant efficiency for multimodal tasks and is optimized to run in a single-GPU environment. Llama 4 Maverick is the most intelligent model option Meta provides today, designed for reasoning, complex image understanding, and demanding generative tasks.

With Llama 4 as a fully managed API endpoint, you can now leverage Llama 4's advanced reasoning, coding, and instruction-following capabilities with the ease, scalability, and reliability of Vertex AI to build more sophisticated and impactful AI-powered applications.

This post will guide you through getting started with Llama 4 as a Model-as-a-Service (MaaS), highlight the key benefits, show you how simple it is to use, and touch upon cost considerations.


Discover Llama 4 MaaS in Vertex AI Model Garden

Vertex AI Model Garden is your central hub for discovering and deploying foundation models on Google Cloud via managed APIs. It offers a curated selection of Google's own models (like Gemini), open-source models, and third-party models — all accessible through simplified interfaces. The addition of Llama 4 (GA) as a managed service expands this selection, offering you more flexibility.

Llama 4 MaaS in Vertex AI Model Garden

Accessing Llama 4 as a Model-as-a-Service (MaaS) on Vertex AI has the following advantages:

1: Zero infrastructure management: Google Cloud handles the underlying infrastructure, GPU provisioning, software dependencies, patching, and maintenance. You interact with a simple API endpoint.

2: Guaranteed performance with provisioned throughput: Reserve dedicated processing capacity for your models at a fixed fee, ensuring high availability and prioritized processing for your requests, even when the system is overloaded.

3: Enterprise-grade security and compliance: Benefit from Google Cloud's robust security, data encryption, access controls, and compliance certifications.


Getting started with Llama 4 MaaS

Getting started with Llama 4 MaaS on Vertex AI only requires you to navigate to the Llama 4 model card within the Vertex AI Model Garden and accept the Llama Community License Agreement; you cannot call the API without completing this step.

Once you have accepted the Llama Community License Agreement in the Model Garden, find the specific Llama 4 MaaS model you wish to use within the Vertex AI Model Garden (e.g., "Llama 4 17B Instruct MaaS"). Take note of its unique Model ID (like meta/llama-4-scout-17b-16e-instruct-maas), as you'll need this ID when calling the API.

Then you can directly call the Llama 4 MaaS endpoint using the ChatCompletion API. There's no separate "deploy" step required for the MaaS offering – Google Cloud manages the endpoint provisioning. Below is an example of how to use Llama 4 Scout using the ChatCompletion API for Python.

import openai
from google.auth import default, transport
import os

# --- Configuration ---
PROJECT_ID = "<YOUR_PROJECT_ID>" 
LOCATION = "us-east5"
MODEL_ID = "meta/llama-4-scout-17b-16e-instruct-maas" 

# Obtain Application Default Credentials (ADC) token
credentials, _ = default()
auth_request = transport.requests.Request()
credentials.refresh(auth_request) 
gcp_token = credentials.token

# Construct the Vertex AI MaaS endpoint URL for OpenAI library
vertex_ai_endpoint_url = (
    f"https://{LOCATION}-aiplatform.googleapis.com/v1beta1/"
    f"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openapi"
)

# Initialize the client to use ChatCompletion API pointing to Vertex AI MaaS
client = openai.OpenAI(
        base_url=vertex_ai_endpoint_url,
        api_key=gcp_token, # Use the GCP token as the API key
    )

# Example: Multimodal request (text + image from Cloud Storage)
prompt_text = "Describe this landmark and its significance."
image_gcs_uri = "gs://cloud-samples-data/vision/landmark/eiffel_tower.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": image_gcs_uri},
            },
            {"type": "text", "text": prompt_text},
        ],
    }
]

# Optional parameters (refer to model card for specifics)
max_tokens_to_generate = 1024
request_temperature = 0.7
request_top_p = 1.0

# Call the ChatCompletion API
response = client.chat.completions.create(
        model=MODEL_ID, # Specify the Llama 4 MaaS model ID
        messages=messages,
        max_tokens=max_tokens_to_generate,
        temperature=request_temperature,
        top_p=request_top_p,
        # stream=False # Set to True for streaming responses
    )

generated_text = response.choices[0].message.content
print(generated_text)
# The image contains...

Important: Always consult the specific Llama 4 model card in Vertex AI Model Garden. It contains crucial information about:

  • The exact input/output schema expected by the model.

  • Supported parameters (like temperature, top_p, max_tokens) and their valid ranges.

  • Any specific formatting requirements for prompts or multimodal inputs.


Cost and quota considerations

Using the Llama 4 as Model-as-a-Service on Vertex AI operates on a predictable model combining pay-as-you-go pricing with usage quotas. Understanding both the pricing structure and your service quotas is essential for scaling your application and managing costs effectively when using the Llama 4 MaaS on Vertex AI.

In regards to pricing, you pay only for the prediction requests you make. The underlying infrastructure, scaling, and management costs are incorporated into the API usage price. Refer to the Vertex AI pricing page for details.

To ensure service stability and fair usage, your use of Llama 4 as Model-as-service on Vertex AI is subject to quotas. These are limits on factors such as the number of requests per minute (RPM) your project can make to the specific model endpoint. Refer to our quota documentation for more details.


What’s next

With Llama 4 now generally available as a Model-as-a-Service on Vertex AI, you can leverage one of the most advanced open LLMs without managing required infrastructure.


We are excited to see what applications you will build with Llama 4 on Vertex AI. Share your feedback and experiences through our Google Cloud community forum.