The Gemini API and the Internet of Things

MAR 31, 2025
Paul Ruiz Senior Developer Relations Engineer

The Internet of Things (IoT) space is changing rapidly with the introduction of artificial intelligence into everything. Thanks to the advancement in AI and cloud services, simple microcontrollers, along with standard sensors and actuators, can be integrated into a variety of things to create interactive intelligent devices. In this post, we’ll explore how IoT developers can leverage the Gemini REST API to create devices that both understand and react to custom speech commands, bridging the gap between the digital and physical worlds to solve practical and previously challenging problems.

To keep things simple, this post will stick to high level concepts, but you can see the full code example and device schematic leveraging the ESP32 microcontroller on GitHub.


From Voice to Action: The power of Speech Recognition and Custom Functions

Traditionally, integrating speech recognition into IoT devices, especially those with limited memory, has been a complex task. While solutions like LiteRT for Microcontrollers enable you to run basic models to recognize keywords, human language is a much broader and more nuanced input that developers can use to their advantage. The Gemini API simplifies this by providing a powerful, cloud-based solution that understands a wide range of spoken language, even across different languages, all from a single tool, while also being able to determine what actions an embedded device should take based on user input.

These capabilities rely on the Gemini API’s ability to process and interpret audio data from an IoT device, as well as determine the next step the device should take, following this process:

1. Audio capture: The IoT device, equipped with a microphone, captures a spoken sentence.

2. Audio encoding: Speech is encoded into a format for internet transmission. In the official example mentioned above, we convert analog signals to WAV format audio, then to a base64 encoded string for the Gemini API.

3. API request: The encoded audio is sent to the Gemini API via a REST API call. This call includes instructions, such as requesting the text of the spoken command, or directing Gemini to select a predefined custom function (e.g., turning on lights). If using the Gemini API’s function calling feature, you must provide function definitions, including names, descriptions, and parameters, within your request JSON.

4. Processing: The Gemini API’s AI models analyze the encoded audio and determine the appropriate response.

5. Response: The API returns information to the IoT device, such as a transcript of the audio, the next function to call, or a text response with further instructions.


For example, let’s consider controlling an LED with voice commands to turn it on or off and change its color. We can define two functions: one to toggle the LED and another to change its color. Instead of limiting the color to a preset range, we can allow any RGB value from 0 to 255, offering over 16 million possible combinations.

The following request, including the base64 encoded audio string ($DATA), demonstrates this:

{
    "contents": [
        {
            "parts": [
                {
                    "text": "Trigger a function based on this audio input."
                },
                {
                    "inline_data": {
                        "mime_type": "audio/x-wav",
                        "data": "$DATA"
                    }
                }
            ]
        }
    ],
    "tools": [
        {
            "function_declarations": [
                {
                    "name": "changeColor",
                    "description": "Change the default color for the lights in an RGB format. Example: Green would be 0 255 0",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "red": {
                                "type": "integer",
                                "description": "A value from 0 to 255 for the color RED in an RGB color code"
                            },
                            "green": {
                                "type": "integer",
                                "description": "A value from 0 to 255 for the color GREEN in an RGB color code"
                            },
                            "blue": {
                                "type": "integer",
                                "description": "A value from 0 to 255 for the color BLUE in an RGB color code"
                            }
                        },
                        "required": [
                            "red",
                            "green",
                            "blue"
                        ]
                    }
                },
                {
                    "name": "toggleLights",
                    "description": "Turn on or off the lights",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "toggle": {
                                "type": "boolean",
                                "description": "Determine if the lights should be turned on or off."
                            }
                        },
                        "required": [
                            "toggle"
                        ]
                    }
                }
            ]
        }
    ]
}

While this is a very simplified example, it does highlight numerous practical benefits for IoT development:

  • Enhanced user experience: Developers can easily support voice input, providing a more intuitive and natural interaction, even for low-memory devices.

  • Simplified command handling: This setup eliminates the need for complex parsing logic, such as trying to break down each spoken command or waiting for more complex manual inputs to pick the next function to run.

  • Dynamic function execution: The Gemini AI intelligently selects the appropriate action based on user intent, making devices more dynamic and capable of complex operations.

  • Contextual understanding: While older speech recognition patterns needed a structure similar to “turn on the lights” or “set the brightness to 70%”, the Gemini API can understand more general statements, such as “it’s dark in here!”, “give me some reading light”, or "make it dark and spooky in here" to provide an appropriate solution to users without it being specified.

By combining function calling and audio input with the Gemini API, developers can create IoT devices that intelligently respond to spoken commands.


Turning Ideas into Reality

While audio and function calling are essential tools for enhancing IoT devices with AI, there’s so much more that can be used to create amazing and useful intelligent devices. Some of the potential areas for exploration include:

  • Smart home automation: Control lights, appliances, and other devices with voice commands, improving convenience and accessibility.

  • Robotics: Issue spoken commands to robots or send streams of images or video to the Gemini API for navigation, task execution, and interaction, automating repetitive tasks and providing assistance in various settings.

  • Industrial IoT: Enhance specialized machinery and equipment to increase productivity and reduce risk for the people that rely on them.


Next Steps

We’re excited to see all of the great things you build with the Gemini API! Your applications can transform the way we interact with the world around us and solve real world problems with the power of AI. Please share your projects with us on Google AI for Developers on LinkedIn and Google AI Developers on X.