The latest generation of Gemini models, 2.5 Pro and Flash, are unlocking new frontiers in robotics. Their advanced coding, reasoning, and multimodal capabilities, now combined with spatial understanding, provide the foundation for the next generation of interactive and intelligent robots.
This post explores how developers can leverage Gemini 2.5 to build sophisticated robotics applications. We'll provide practical examples with prompts to show using Gemini 2.5 and the Live API for:
In March, we launched our Gemini Robotics models, including Gemini Robotics-ER, our advanced embodied reasoning model optimized for the unique demands of robotics applications. We’re also excited to share how our Gemini Robotics trusted testers are already demonstrating the power of Gemini in robotics applications. We are including examples from Agile Robots, Agility Robotics, Boston Dynamics, and Enchanted Tools. Join the Gemini Robotics-ER trusted tester program waitlist.
Reasoning about the physical world is at the core of general and robust control. Gemini 2.5 represents a step in this direction with its improved ability to reason multimodally. Below we share two examples, utilizing Gemini’s pointing and object detection capabilities.
Pointing allows a model to refer to entities or parts of entities precisely, and locate them in space. Gemini 2.5 Pro is able to reason about the entities it is pointing to, opening new opportunities for interacting with images. For example, Gemini 2.5 Pro is able to reason about empty space in the context of a supermarket display, knowing that this indicates restocking may be needed. In the example below, Gemini identifies the baby eggplant needs restocking. Gemini 2.5 Pro also shows a nascent ability to locate and read information from that location, as illustrated in the gauge example.
Example 1: Gemini 2.5 can locate objects in the scene based on fine-grained language descriptions, for example, find a shelf that needs restocking.
Prompt: Point to one bin on the shelf that needs restocking. The answer should follow the json format: [{"point": <point>, "label": <label1>}, ...]. The points are in [y, x] format normalized to 0-1000.
Input image with response overlay:
Example 2: Gemini 2.5 can locate small objects in the scene and estimate states of those objects. For example, it can read gauges.
Prompt: Point to all the round gauges. The answer should follow the json format: [{"point": <point>, "label": <label1>}, ...]. The points are in [y, x] format normalized to 0-1000.
Input image with response overlay:
Prompt: What is the reading on the middle gauge?
Response: Based on the close-up view, the round gauge in the center-left of the image appears to be reading 0. The needle is pointing directly at the "0" mark on the dial.
Gemini 2.5 is able to accurately track multiple objects across time and detect open ended concepts like ‘a spill’. Gemini 2.5 can be prompted into trajectory prediction in the form of a sequence of points.
Example 1: Gemini 2.5 can generate bounding boxes for each frame in a video and be visualized like below.
Prompt: Detect green bowl, crab, wallet, pink bowl, phone, return a json array with keys box_2d and label. (executed per frame).
Input image with response overlay:
Example 2: Gemini 2.5 can detect open-ended concepts relevant to robotics, requiring commonsense knowledge and context specific reasoning. For example, a helpful robot needs to understand the concept of a “spill”.
Prompt:
1) Show me the bounding box of spill. Return in a json array with keys box_2d and label.
2) Give the segmentation masks for the spill. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key "box_2d", the segmentation mask in key "mask", and the text label in the key "label".
Input image with response overlay:
Example 3: Gemini 2.5 can be prompted into trajectory prediction in the form of a sequence of points.
Prompt: Generate a robot arm trajectory of 10 points to move the cloth to the spill. The answer should follow the json format: [{"point": <point>, "label": <label1>}, ...]. The points are in [y, x] format normalized to 0-1000.
Input image with response overlay:
Gemini 2.5 can utilize its underlying spatial understanding to control robots through code generation. By providing Gemini 2.5 with a robot control API, it can apply advanced capabilities in scene understanding, object manipulation, and code writing together to perform tasks zero-shot, with no additional training.
Example 1 below showcases code-generation for “Put the banana in the bowl”. It gives Gemini access to a robot control API and shows how the model leverages its spatial understanding, thinking, and code generation capabilities to select the appropriate API calls and arguments given the task. Gemini 2.5 generates 2 different feasible plans for putting the banana in the bowl. The first solution is to simply pick up the banana, move it above the bowl, and drop it. The second solution lifts the banana, moves the bowl below the banana, and then drops the banana.
Example 1: Gemini 2.5 Generate high-level planning code for pick-and-place tasks with different strategies.
Prompt:
You are given a robot control API and example code below:
# Provided API
class RobotAPI:
def detect_object(self, obj: str) -> Detection
"""Detects the given object's XYZ location"""
def get_grasp_position_and_euler_orientation(self, gripper: str, object_name: str) -> tuple[numpy.ndarray, numpy.ndarray]:
"""Returns the grasp position and orientation for the given object and gripper from the Gemini Robotics ER model or other grasping primitives.
def move_gripper_to(self, gripper: str, position: numpy.ndarray, orientation: numpy.ndarray):
"""Moves the gripper to the given position and orientation."""
def move_gripper_to_safe_position(self, gripper: str):
"""Moves the given gripper to a safe position out of the table area."""
def open_gripper(self, gripper: str):
"""Opens the given gripper."""
def close_gripper(self, gripper: str):
"""Closes the given gripper."""
robot=RobotAPI()
# Example code
robot.open_gripper(gripper='RIGHT')
banana_detection = robot.detect_object(obj='banana')
banana_position = banana_detection.position
print(f"Detected banana position: {banana_position}") # Gemini uses this to determine which gripper to use.
grasp_position, grasp_orientation = robot.get_grasp_position_and_euler_orientation(gripper='RIGHT', object_name='banana')
pre_grasp_position = grasp_position + np.array([0, 0, 0.1]) # Add 10cm to the Z-coordinate for pre-grasp
robot.move_gripper_to(gripper='RIGHT', position=pre_grasp_position, orientation=grasp_orientation)
robot.move_gripper_to(gripper='RIGHT', position=grasp_position, orientation=grasp_orientation)
robot.close_gripper(gripper='RIGHT')
Now generate code for the task “put the banana in the bowl”.
Response 1 with results of code executed:
Reasoning (abbreviated)
1: The current position of the right gripper is [-0.04, -0.028, 0.176]
.
2: The bowl's detected position is [-0.03839983, -0.02811164, 0.0073104]
.
3: To place the banana inside the bowl, the gripper's Z-coordinate needs to be lowered. I will use the bowl's detected X and Y coordinates for the target position. For the Z-coordinate, I will use the bowl's detected Z-coordinate (0.0073104
) and add a small offset (e.g., 0.03m) to ensure the banana is placed inside the bowl and not through the table. This would result in a target Z of approximately 0.007 + 0.03 = 0.037m
. This Z-coordinate is higher than the bowl's base but lower than the typical height of a banana when standing on the table, ensuring it's placed *into* the bowl.
Code Snippet:
# Detect the bowl's position (re-detect to ensure fresh data, though it's unlikely to have moved)
bowl_detection = robot.detect_object('bowl')
if bowl_detection.position is None:
raise ValueError("Bowl not detected.")
# Define the target Z-coordinate for dropping the banana into the bowl
# Use the bowl's Z-position and add an offset to place the banana inside.
# The banana's original Z was ~0.032, bowl's Z is ~0.007.
# A target Z of 0.037 (0.007 + 0.03) should place it inside the bowl.
drop_z = bowl_detection.position[2] + 0.03
Response 2 with results of code executed:
Reasoning (abbreviated)
# Re-planning due to reachability constraints: The RIGHT arm cannot reach the bowl's current position.
# The plan will now involve the LEFT arm moving the bowl to the RIGHT arm's reachable area.
# The RIGHT arm can reach x-coordinates from 0.04m to 0.40m. I will move the bowl to an x-coordinate of 0.10m, which is well within the RIGHT arm's reach and also allows for some clearance from the center.
Code Snippet:
target_bowl_position = [-0.07274145, -0.08317622, 0.00720479] # Original bowl position
# Move the bowl to an x-coordinate reachable by the RIGHT arm, keeping the same y and z (lifted)
# Let's choose x = 0.10m for the bowl, which is within the right arm's reach (0.04 to 0.40)
# Keep the lifted height at 0.20m
move_bowl_position = [0.10, bowl_detection.position[1], 0.20]
robot.move_gripper_to('LEFT', move_bowl_position, current_left_gripper_orientation)
Gemini 2.5 can also effectively utilize a small number of in-context examples to perform more dexterous robot control tasks. In the two examples shown below of an ALOHA robot packing boxes and folding a dress, 10 demonstrations consisting of interleaved reasoning and robot actions for each task were added to Gemini’s context. We’ve created open-source code showing how to do this using Gemini, including examples of the input demonstrations. This enables robots to be taught and deployed on the spot. See the Colab.
Example 2: Gemini 2.5 (Flash) utilizes a small number of in-context examples to perform more dexterous robot control tasks.
Prompt: see colab.
Response with results of code executed:
The Live API for realtime streaming was recently introduced and can be used to build interactive applications that let people control robots using their voice. Intuitive human-robot-interaction is an important aspect of making robots that are easy and safe to use. We recently showcased an interactive Gemini Robotics demo at I/O 2025, which was built around Live API for voice interaction and function calling.
Live API supports both audio and video as input modalities, and audio / text as output modalities. This allows you to send both voice input and the robot camera feed to the Live API. This is even more powerful when combined with tool use.
Tool use allows Live API to go beyond just conversation by enabling it to perform actions in the real-world while maintaining a real time connection. For example, the robot APIs defined above can be defined as function calls including robot.open_gripper()
, robot.close_gripper()
and robot.move_gripper_to()
. After they are defined as tool calls, they can be integrated into the workflow where people can interact with the robot using voice in real time. Developers can get started on GitHub, and refer to API documentation for function calling features.
The 2.5 Pro and 2.5 Flash models demonstrate robust performance on the ASIMOV Multimodal and Physical Injury benchmarks released along with the Gemini Robotics tech report, exhibiting accuracy comparable to that of 2.0 models. Beyond the ASIMOV benchmarks, the 2.5 Pro and 2.5 Flash models also exhibit excellent performance in rejecting prompts that attempt to leverage embodied reasoning capabilities while violating safety policies such as promoting harmful stereotypes, discrimination, or endangerment of minors. Following rigorous evaluation against such synthetically generated adversarial prompts, 2.5 Pro and Flash demonstrated near-zero violation rates.
In March we released the Gemini Robotics-ER model and we’re already inspired by how the community is using it for robotics applications. Check out these examples of interactivity, perception, planning, and function calling from our trusted testers: Agile Robots, Agility Robotics, Boston Dynamics, and Enchanted Tools.
We can’t wait to see what you create.
Embodied reasoning in Gemini 2.5 Flash and Pro are available in Google AI Studio, the Gemini API, and Vertex AI. To start building with these models in the Gemini API, visit our developer guide to get started. If you are interested in building with Gemini Robotics-ER, please sign up for the trusted tester program.
We thank researchers in the Embodied Reasoning team: Alex Hofer, Annie Xie, Arunkumar Byravan, Ashwin Balakrishna, Assaf Hurwitz Michaely, Carolina Parada, David D'Ambrosio, Deepali Jain, Jacky Liang, Jie Tan, Junkyung Kim, Kanishka Rao, Keerthana Gopalakrishnan, Ksenia Konyushkova, Lewis Chiang, Marissa Giustina, Mohit Sharma, Montserrat Gonzalez Arenas, Nicolas Heess, Peng Xu, Pierre Sermanet, Sean Kirmani, Stefani Karp, Stefano Saliceti, Steven Hansen, Sudeep Dasari, Ted Xiao, Thomas Lampe, Tianli Ding, Wenhao Yu, and Wentao Yuan; Gemini team: Xi Chen, Weicheng Kuo, and Paul Voigtlaender; Robotics Safety team: Vikas Sindhwani and Abhishek Jindal; Product and Program support: Kendra Byrne and Sally Jesmonth; and members of developer relationship team: Paul Ruiz and Paige Bailey, for helping with this article.