Unlock Gemini’s reasoning: A step-by-step guide to logprobs on Vertex AI

JULY 16, 2025

Eric Dong Developer Advocate

Ever wished you could see why the Gemini model chose one word over another? For developers building sophisticated applications, understanding a model's decision-making process is crucial. Now you can. The model's reasoning is no longer a black box. The logprobs feature has been officially introduced in the Gemini API on Vertex AI, unlocking a deeper view into the model's choices.

This feature allows you to see the probability scores for the chosen token and its top alternatives, giving you unprecedented insight into the model's reasoning. This is more than just a debugging tool; it's a way to build smarter, more reliable, and context-aware applications.

This blog post will walk you step-by-step through the "Intro to Logprobs" notebook, showing you how to enable this feature and apply it to powerful use cases like confident classification, dynamic autocomplete, and quantitative RAG evaluation.

Getting started: Setting up your environment

First, let's get your environment ready to follow along.

Install the SDK: You'll need the Google GenAI SDK for Python. If you don't have it, install or upgrade it with the following command:

%pip install -U -q google-genai

Python

2. Configure your project: Set up your Google Cloud Project ID and location. This tutorial uses these credentials to authenticate your environment.

PROJECT_ID = "[your-project-id]"

Python

3. Initialize the client and model: Create a client to interact with the Vertex AI API and specify the Gemini model you want to use.

from google import genai
client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")
MODEL_ID = "gemini-2.5-flash"

Python

What exactly are logprobs?

Before enabling the feature, let's clarify what a log probability is. A logprob is the natural logarithm of the probability score the model assigned to a token.

Here are the key concepts:

A probability is always a value between 0 and 1.

The natural logarithm of any number between 0 and 1 is always a negative number.

The natural logarithm of 1 (representing 100% certainty) is 0.

Therefore, a logprob score closer to 0 indicates higher model confidence in its choice.

Enabling and processing logprobs

You can enable log probabilities by setting two parameters in your request's generation_config.

response_logprobs=True: This tells the model to return the log probabilities of the tokens it chose for its output. Its default is False.

logprobs=[integer]: This asks the model to also return the log probabilities for a specified number of the top alternative tokens at each step. The accepted value is between 1 and 20.

Let's see this in action with a simple classification prompt. We'll ask the model to classify a sentence as "Positive," "Negative," or "Neutral".

from google.genai.types import GenerateContentConfig
prompt = "I am not sure if I really like this restaurant a lot."
response_schema = {"type": "STRING", "enum": ["Positive", "Negative", "Neutral"]}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=prompt,
    generation_config=GenerateContentConfig(
        response_mime_type="application/json",              
        response_schema=response_schema,
        response_logprobs=True,
        logprobs=3,
    ),
)

Python

Processing the logprobs result

The logprobs data is returned within the response object. You can use a helper function to print the log probabilities in a readable format. The following function provides a detailed look at the model's predictions for each token in the generated response.

def print_logprobs(response):
 """
 Print log probabilities for each token in the response
 """
 if response.candidates and response.candidates[0].logprobs_result:
   logprobs_result = response.candidates[0].logprobs_result
   for i, chosen_candidate in enumerate(logprobs_result.chosen_candidates):
     print(
       f"Token: '{chosen_candidate.token}' ({chosen_candidate.log_probability:.4f})"
       )
     if i < len(logprobs_result.top_candidates):
       top_alternatives = logprobs_result.top_candidates[i].candidates
       alternatives = [
         alt
         for alt in top_alternatives
         if alt.token != chosen_candidate.token
       ]
       if alternatives:
         print("Alternative Tokens:")
         for alt_token_info in alternatives:
           print(
           f"  - '{alt_token_info.token}': ({alt_token_info.log_probability:.4f})"
           )
     print("-" * 20)

Python

Interpreting the output

Let's say the model returned the following:

Token: 'Neutral' (-0.0214)

2. Alternative Tokens:

'Positive': (-4.8219)

'Negative': (-5.6293)

This output shows the model is highly confident in its choice of "Neutral," as its log probability (-0.0214) is very close to 0. The alternatives, "Positive" and "Negative," have significantly lower (more negative) log probabilities, indicating the model considered them much less likely.

Use case 1: Smarter classification

Logprobs transform classification from a simple answer into a transparent decision, allowing you to build more robust systems.

Detecting ambiguity for human review

Scenario: You want to automate a classification task but need to flag ambiguous cases where the model isn't confident.

Why Logprobs?: A small difference between the top two log probabilities is a signal of ambiguity. You can check this difference against a margin to decide if human review is needed.

The check_for_ambiguity function in the notebook calculates the absolute difference between the top and second choice's logprobs. If this difference is less than a predefined ambiguity_margin, it flags the result as ambiguous.

Confidence-based thresholding

Scenario: For critical applications, you only want to accept a classification if the model's confidence exceeds a certain level, like 90%.

Why Logprobs?: Logprobs give you a direct measure of the model's confidence. You can convert the log probability of the chosen token back into a raw probability score to apply a threshold.

The accept_if_confident function demonstrates this by using math.exp() to convert the logprob to a probability percentage and then checks if it's above a threshold.

Use case 2: Dynamic auto-complete

Scenario: You want to build an auto-complete feature that suggests the most likely next words as a user types.

Why Logprobs?: By repeatedly querying the model with the growing text and examining the logprobs of the top candidates, you can observe the model's shifting predictions and offer relevant, real-time suggestions.

The notebook simulates a user typing "The best thing about living in Toronto is the" word by word . As the context grows, the suggestions for the next word become more specific and accurate:

Analysis:

Notice how the predictions evolve. For "The", the suggestions are generic starting words.

With the full context of "The best thing about living in Toronto is the", the model confidently predicts specific Toronto attributes like "diversity" and "food," showcasing its deep contextual understanding.

This shows how logprobs give you a window into the model's contextual understanding, allowing you to build more adaptive and context-aware features.

Use case 3: Quantitative RAG evaluation

Scenario: You have a RAG system and need to evaluate how well its answers are supported by the retrieved context.

Why Logprobs?: When an LLM has relevant context, its confidence in generating a factually consistent answer increases, which is reflected in higher log probabilities. You can calculate an average log probability for the generated answer to get a "grounding" or "confidence" score.

The notebook sets up a fictional knowledge base and tests three scenarios: good retrieval, poor retrieval, and no retrieval.

Analysis:

The results show a clear correlation. The "Good Retrieval" scenario has the highest score (closest to zero), as the model is very confident its answer is supported by the text. This makes logprobs a powerful metric for automating the evaluation and improvement of your RAG systems.

What's next?

You've now learned how to use the logprobs feature to gain deeper insight into the model's decision-making process. We've walked through how to apply these insights to analyze classification results, build dynamic autocomplete features, and evaluate RAG systems.

These are just a few of the example applications, and with this feature, there are many more use cases for developers to explore.

To dive deeper, explore the full notebook and consult the official documentation: