As an engineer, I’ve always been fascinated by languages—both the kind we code in and the kind we speak. Learning a new programming language typically begins by building something tangible, instantly putting theory into practice. Learning a new spoken language, on the other hand, often happens in a vacuum—through textbooks or exercises that feel strangely disconnected from the situations where language actually matters. As is the case with programming, language is best learned through meaningful contexts: the conversations we have, the objects around us, the moments we find ourselves in. Unlike traditional learning tools, AI can adapt to a learner’s context, making it uniquely suited to help us practice languages in ways that feel more natural and personal.
This led me, along with a small group of colleagues, to experiment with the Gemini API, which enables developers to access the latest generative models from Google. The result is Little Language Lessons: a collection of three bite-sized learning experiments, all powered by Google’s Gemini models.
One of the most frustrating parts about learning a language is finding yourself in a situation where you need a specific word or phrase—and it’s one that you haven’t learned yet.
That’s the idea behind Tiny Lesson. You describe a situation—maybe it’s “asking for directions” or “finding a lost passport”—and receive useful vocabulary, phrases, and grammar tips tailored to that context.
We were able to accomplish this using a simple prompt recipe. The prompt begins with a persona-setting preamble that looks like this:
You are a(n) {target language} tutor who is bilingual in {target language} and
{source language} and an expert at crafting educational content that is
custom-tailored to students' language usage goals.
In this prompt and in all of the prompts to come, we took advantage of Gemini’s ability to provide outputs as structured JSON, defining desired result as a list of keys in an object:
For the given usage context, provide a JSON object containing two keys:
"vocabulary" and "phrases".
The value of "vocabulary" should be an array of objects, each containing three
keys: "term", “transliteration”, and "translation".
The value of "term" should be a {target language} word that is highly relevant
and useful in the given context.
If the language of interest is ordinarily written in the Latin script, the
value of “transliteration” should be an empty string. Otherwise, the value of
“transliteration” should be a transliteration of the term.
The value of "translation" should be the {source language} translation of
the term.
...
In total, each lesson is the result of two calls to the Gemini API. One prompt handles generating all of the vocabulary and phrases, and the other deals with generating relevant grammar topics.
And the end of each prompt, we interpolate the user’s desired usage context as follows:
INPUT (usage context): {user input}
There’s a moment in the journey of learning a language when you start feeling comfortable. You can hold conversations, express yourself, and mostly get by. But then you realize, you still sound… off. Too formal. Stiff.
We built Slang Hang to help address this. The idea is simple: generate a realistic conversation between native speakers and let users learn from it. You can watch the dialogue unfold, revealing one message at a time and unpacking unfamiliar terms as they appear.
The preamble for the Slang Hang prompt looks like this:
You are a screenwriter who is bilingual in {source language} and
{target language} and an expert and crafting captivating dialogues.
You are also a linguist and highly attuned to the cultural nuances that
shape natural speech.
Although users can only reveal messages one at a time, everything—the setting, the conversation, the explanations for highlighted terms—is generated from a single call to the Gemini API. We define the structure of the JSON output as follows:
Generate a short scene that contains two interlocutors speaking authentic
{target language}. Give the result as a JSON object that contains two keys:
"context" and "dialogue".
The value of "context" should be a short paragraph in {SOURCE LANGUAGE}
that describes the setting of the scene, what is happening, who the speakers
are, and speakers' relationship to each other.
The value of "dialogue" should be an array of objects, where each object
contains information about a single conversational turn. Each object in the
"dialogue" array should contain four keys: "speaker", "gender", "message",
and "notes".
...
The dialogue is generated in the user’s target language, but users can also translate messages into their native language (a functionality powered by the Cloud Translation API).
One of the more interesting aspects of this experiment is the element of emergent storytelling. Each scene is unique and generated on the fly—it could be a street vendor chatting with a customer, two coworkers meeting on the subway, or even a pair of long-lost friends unexpectedly reuniting at an exotic pet show.
That said, we found that this experiment is somewhat susceptible to accuracy errors: it occasionally misuses certain expressions and slang, or even makes them up. LLMs still aren’t perfect, and for that reason it’s important to cross-reference with reliable sources.
Sometimes, you just need words for the things in front of you. It can be extremely humbling to realize just how much you don’t know how to say in your target language. You know the word for “window”, but how do you say “windowsill”? Or “blinds”?
Word Cam turns your camera into an instant vocabulary helper. Snap a photo, and Gemini will detect objects, label them in your target language, and give you additional words that you can use to describe them.
This experiment leverages Gemini’s vision capabilities for object detection. We send the model an image and ask it for the bounding box coordinates of the different objects in that image:
Provide insights about the objects that are present in the given image.
Give the result as a JSON object that contains a single key called "objects".
The value of "objects" should be an array of objects whose length is no more
than the number of distinct objects present in the image. Each object in the
array should contain four keys: "name", "transliteration", "translation", and
"coordinates".
...
The value of "coordinates" should be an integer array representing the
coordinates of the bounding box for the object. Give the coordinates as [ymin,
xmin, ymax, xmax].
Once the user selects an object, we send the cropped image to Gemini in a separate prompt and ask it to generate descriptors for that object in the user’s target language:
For the object represented in the given image, provide descriptors
that describe the object. Give the result as a JSON object that contains
a single key called "descriptors".
The value of "descriptors" should be an array of objects, where each
object contains five keys: "descriptor", "transliteration", "translation",
"exampleSentence", "exampleSentenceTransliteration", and
"exampleSentenceTranslation".
...
Across all three experiments, we also integrated text-to-speech functionality, allowing users to hear pronunciations in their target language. We did this using the Cloud Text-to-Speech API, which offers natural-sounding voices for widely spoken languages but limited options for less common ones. Regional accents aren’t well-represented, and thus there are sometimes mismatches between the user’s selected dialect and the accent of the playback.
Although Little Language Lessons is just an early exploration, experiments like these hint at exciting possibilities for the future. This work also raises a few important questions: what might it look like to collaborate with linguists and educators to refine the approaches we investigate in Little Language Lessons? More broadly, how can AI make independent learning more dynamic and personalized?
For now, we’re continuing to explore, iterate, and ask questions. If you’d like to check out more experiments like this one, head over to labs.google.com.