The Internet of Things (IoT) space is changing rapidly with the introduction of artificial intelligence into everything. Thanks to the advancement in AI and cloud services, simple microcontrollers, along with standard sensors and actuators, can be integrated into a variety of things to create interactive intelligent devices. In this post, we’ll explore how IoT developers can leverage the Gemini REST API to create devices that both understand and react to custom speech commands, bridging the gap between the digital and physical worlds to solve practical and previously challenging problems.
To keep things simple, this post will stick to high level concepts, but you can see the full code example and device schematic leveraging the ESP32 microcontroller on GitHub.
Traditionally, integrating speech recognition into IoT devices, especially those with limited memory, has been a complex task. While solutions like LiteRT for Microcontrollers enable you to run basic models to recognize keywords, human language is a much broader and more nuanced input that developers can use to their advantage. The Gemini API simplifies this by providing a powerful, cloud-based solution that understands a wide range of spoken language, even across different languages, all from a single tool, while also being able to determine what actions an embedded device should take based on user input.
These capabilities rely on the Gemini API’s ability to process and interpret audio data from an IoT device, as well as determine the next step the device should take, following this process:
1. Audio capture: The IoT device, equipped with a microphone, captures a spoken sentence.
2. Audio encoding: Speech is encoded into a format for internet transmission. In the official example mentioned above, we convert analog signals to WAV format audio, then to a base64 encoded string for the Gemini API.
3. API request: The encoded audio is sent to the Gemini API via a REST API call. This call includes instructions, such as requesting the text of the spoken command, or directing Gemini to select a predefined custom function (e.g., turning on lights). If using the Gemini API’s function calling feature, you must provide function definitions, including names, descriptions, and parameters, within your request JSON.
4. Processing: The Gemini API’s AI models analyze the encoded audio and determine the appropriate response.
5. Response: The API returns information to the IoT device, such as a transcript of the audio, the next function to call, or a text response with further instructions.
For example, let’s consider controlling an LED with voice commands to turn it on or off and change its color. We can define two functions: one to toggle the LED and another to change its color. Instead of limiting the color to a preset range, we can allow any RGB value from 0 to 255, offering over 16 million possible combinations.
The following request, including the base64 encoded audio string ($DATA
), demonstrates this:
{
"contents": [
{
"parts": [
{
"text": "Trigger a function based on this audio input."
},
{
"inline_data": {
"mime_type": "audio/x-wav",
"data": "$DATA"
}
}
]
}
],
"tools": [
{
"function_declarations": [
{
"name": "changeColor",
"description": "Change the default color for the lights in an RGB format. Example: Green would be 0 255 0",
"parameters": {
"type": "object",
"properties": {
"red": {
"type": "integer",
"description": "A value from 0 to 255 for the color RED in an RGB color code"
},
"green": {
"type": "integer",
"description": "A value from 0 to 255 for the color GREEN in an RGB color code"
},
"blue": {
"type": "integer",
"description": "A value from 0 to 255 for the color BLUE in an RGB color code"
}
},
"required": [
"red",
"green",
"blue"
]
}
},
{
"name": "toggleLights",
"description": "Turn on or off the lights",
"parameters": {
"type": "object",
"properties": {
"toggle": {
"type": "boolean",
"description": "Determine if the lights should be turned on or off."
}
},
"required": [
"toggle"
]
}
}
]
}
]
}
While this is a very simplified example, it does highlight numerous practical benefits for IoT development:
By combining function calling and audio input with the Gemini API, developers can create IoT devices that intelligently respond to spoken commands.
While audio and function calling are essential tools for enhancing IoT devices with AI, there’s so much more that can be used to create amazing and useful intelligent devices. Some of the potential areas for exploration include:
We’re excited to see all of the great things you build with the Gemini API! Your applications can transform the way we interact with the world around us and solve real world problems with the power of AI. Please share your projects with us on Google AI for Developers on LinkedIn and Google AI Developers on X.