Multilingual innovation in LLMs: How open models help unlock global communication

2025年6月23日
Glenn Cameron Product Marketing Manager AI Developer

We are thrilled to celebrate the incredible contributions of the community to the Unlock Global Communication with Gemma competition on Kaggle! Developers tackled the critical challenge in AI of adapting state-of-the-art large language models (LLMs) for diverse cultural and linguistic contexts.

Models often exhibit a bias towards high-resource languages due to the predominant language of their training and evaluation datasets. This can lead to a performance gap, where the latest AI advancements may not be realized in lower-resourced languages. Additionally, these models may not only lack understanding of the language, but also culturally-relevant context that would make these models helpful for the communities.

We were incredibly impressed by the community's creative solutions for translation of languages, lyrics, old texts, and more.


Honoring the innovators

Through hundreds of submissions, developers demonstrated how to bring the transformative power of LLMs to languages everywhere. Projects leveraged custom datasets and efficient post-training methods to adapt Gemma for instruction following, translation, and specific domains. We encourage you to explore the notebooks on Kaggle to see these techniques in action and apply them to your own multilingual projects.


Gemma 2 Swahili

The first place project adapted Gemma for Swahili understanding, opening up new possibilities to reach 200+ million language speakers. Gemma models were fine-tuned using parameter-efficient fine-tuning techniques for the 2B, 9B, and 27B parameter sizes.

A key aspect of their tuning was Gemma’s “remarkable flexibility in instruction-response formatting,” which allowed the models to parse instructions with minimal structural constraints and generate coherent responses across different input formats.


Kyara: Retrieval Augmentation for LLM Fine-Tuning

Knowledge Yielding Adaptive Retrieval Augmentation (Kyara) explored retrieval processes for LLM fine-tuning, demonstrating how to enhance Gemma’s ability to generate informed responses in Traditional Chinese.

The project focused on building high-quality question & answer (Q&A) datasets using a graph-based approach to knowledge retrieval, inspired on how humans learn by connecting concepts.


ArGemma: Fine-Tuning Gemma for Arabic

The project fine-tuned Gemma for Arabic language tasks, including translation, summarization, storytelling, and dialogue generation.

As a language with a rich historical past, the project also aimed to enhance comprehension of older forms of Arabic used in literary texts and art, employing multiple techniques to bridge tasks between Modern Standard Arabic and Classical Arabic.


Post-Training Gemma for Italian and beyond

This project focused on improving Italian language understanding for Gemma using a cost-effective post-training approach that addresses pitfalls such as hallucinations and catastrophic forgetting.

The 2B and 9B model sizes were fine-tuned on a mix of data, including a new instruction tuning dataset created using LLM-as-a-judge to ensure the quality of translations.


Ancient Chinese Expert: Gemma 2>ChatGPT

This project developed an “Ancient Chinese Expert” using Gemma to understand and generate translations for ancient Chinese texts, highlighting the potential of LLMs for historical cultural preservation.

The model was fine-tuned on a comprehensive dataset to improve linguistic understanding, and post-training included techniques to improve instruction following.


Lyric-Gemma 2: One Song, Different Stories

This project tackled nuanced challenges specific to AI-driven lyric translation, enhancing Gemma’s sensitivity to cultural references and symbolic language, while also ensuring rhythmic fidelity to the original song.

A multilingual dataset contained lyric translations annotated to capture crucial cultural context, emotional tone, and rhythmic features, enabling the model to grasp and replicate the artistic depth of lyrical content.


Fine-tuning Gemma 2 JPN for Yomigana

This project adapted Gemma 2 JPN to generate Yomigana/Furigana, a reading aid for Japanese text and assist language learners or readers encountering complex Kanji.

While other rule-based tools currently exist, LLMs can recognize rare Kanji better and “interpret the context of a sentence, enabling accurate disambiguation of polyphonic Kanji”. The notebook also noted that conversational capabilities had degraded due to training on the singular translation task.


Mathematical Minds: Fine-tuning Gemma 2 for Hindi

This project enhances Gemma’s mathematical and logical understanding in Hindi numeric words, which presents a challenge for models to interpret given complex word formations, for example “दो सौ” for “200” or “ढाई” for “2.5”.

The 9B model was fine-tuned on a curated and human expert-verified dataset featuring a wide array of question types, unlocking uses for AI-driven educational tools, automated tutoring, and localized content


Gemma-2-9b-kk-it: Learning to translate Kazakh

This project fine-tuned the Gemma 2 9B model for translation tasks in Kazakh. A language written in three distinct scripts (Cyrillic, Latin, and Arabic), the Cyrillic version requires approximately twice as many tokens as English, presenting a challenge for training with limited resources.

Model performance showed better benchmarks than the 27B Gemma variant and Google Translate, demonstrating how to adapt LLMs for underrepresented languages using a cost-effective approach.


THEODEN: The Old English Gemma

This project enables Gemma to understand and translate Old English, the earliest recorded form of the English language. A custom dataset with Old English-Modern English language pairs was created to help tackle the challenge of working with historical languages and limited publicly available data.

The notebook also features a bonus audio generation component, based on an open-source Icelandic text-to-speech model, offering an approximation of how speech might have sounded.


10 more awesome projects

  • Gemma PT: This project fine-tuned the ShieldGemma content classifier to detect prejudice and disinformation in Portuguese.


Looking ahead with Gemma 3

With over 7,000 languages spoken worldwide, the potential for AI to bridge communication gaps is immense. The Gemma open model family provides a powerful foundation for developers to adapt high-performing models to low-resource languages.

The innovation and dedication demonstrated by the Kaggle community in adapting Gemma 2 for various languages are truly inspiring. As we continue to build a future where AI empowers global communication for everyone, we're excited for Gemma 3, which brings pretrained support for over 140 languages, making it a great foundation to build on.

We encourage developers to explore the possibilities of Gemma, to share their datasets and models with others, and continue to advance multilingual AI together.