At Google, we believe AI should be helpful for everyone. But it’s hard for AI to be inclusive when so many prominent large language models (LLM) only understand a small fraction of the thousands of languages spoken around the world. This leads many models to unintentionally overlook the cultural and linguistic differences that make each society unique, limiting the immense benefits that LLMs can offer to potentially billions of people.

With Gemma, our family of lightweight and efficient open models, developers and researchers across the globe now have the tools to build LLMs that address these specific cultural differences. Leveraging the same research and technology used to create Gemini, Gemma efficiently understands text across languages, leading to improved multilingual performance, reduced costs, and greater flexibility for creating truly inclusive AI.

Teams like those at INSAIT and AI Singapore have already been empowered to create new possibilities using Gemma variants. INSAIT’s recent release of BgGPT, a state-of the-art Bulgarian model based on gemma-2-27b and AI Singapore’s SEA-LIONv3, a groundbreaking new model for Southeast Asian languages based on gemma-2-9b show how through blending their cultural knowledge and AI expertise, both teams have managed to create new LLMs that meet the unique needs of their communities.

SEA-LION: Building LLMs for diverse SEA communities

Recognizing that Southeast Asia’s (SEA) diverse languages and cultures were underrepresented in existing LLMs, AI Singapore developers created SEA-LION to better reflect the region’s nuances, contexts, and cultural diversity. This family of models has already had an immense impact on local SEA communities. For example, the latest SEA-LION’s model based on Gemma has become the foundation for Sahabat-AI, an Indonesian LLM built by GoTo to power the AI voice assistant on their GoPay app and Gojek app. This allows millions of Indonesians to more naturally use these app services in their local languages and dialects.

The biggest challenge in building a leading LLM for SEA languages was finding high-quality diverse training data. This is why the team collaborated with Google DeepMind & Google Research on Project SEALD, an effort to enhance datasets that can be used to train, fine-tune, and evaluate large language models (LLMs) in languages spoken across Southeast Asia. The team also had to ensure the data they used was relevant, which meant filtering out gambling content or ads that didn’t reflect the region’s true linguistic and cultural heritage. To solve this, they built a working group of native speakers and linguists to ensure each model’s translation was accurate and felt natural for users of different backgrounds.