Introducing Aquarium: An Open Data Platform for Southeast Asian Languages

Southeast Asia is one of the most linguistically diverse regions in the world, with over 650 million people speaking hundreds of languages and dialects. Indeed, many Southeast Asian languages are still missing or underrepresented in the data used to train today’s most powerful AI models. That is a problem – AI that doesn’t understand us can’t serve us. This means missed opportunities for innovation, inclusion, and impact in our communities.

That is why AI Singapore, in partnership with Google, is proud to introduce Aquarium – a new, open-source platform built to accelerate the discovery, sharing, and use of high-quality language datasets for Southeast Asia. In this initial phase, all users are invited to participate in the beta launch of Aquarium (here) for testing and feedback.

What is Aquarium?

Aquarium is a community-driven data platform that makes it easier for anyone – from researchers and developers to policymakers and grassroots organizations to contribute, browse, and collaborate on language datasets across the region.

It is built to support various critical functions. Try it out today to:

🔍 Browse datasets by country, language, and domain – from education to health, governance, and beyond
📥 Contribute your own datasets using guided templates and metadata tools
📊 Explore regional dashboards to understand where the gaps are – and where contributions are most needed
🤝 Collaborate across borders to build stronger, more inclusive AI for Southeast Asia

At its core, Aquarium is about making high-quality language data more open, inclusive, and representative, so that more AI models can better reflect the region’s voices, cultures, and needs.

What problem does Aquarium solve?

Global AI systems today are often trained on data that heavily favours English and a few dominant languages. As a result, Southeast Asian languages are frequently overlooked, leading to models that:

Misinterpret or mistranslate local expressions and context
Exclude smaller or indigenous languages entirely
Perform poorly in real-world, multilingual Southeast Asian settings

Aquarium helps close this gap by making high-quality regional language data more accessible and enabling the community to contribute directly to improving AI for Southeast Asia. It supports efforts to localise AI, promote linguistic diversity, and ensure no language or speaker is left behind.

What’s next?

The launch of the Aquarium platform at the 4th Languages Summit, co-hosted by Google and AISG, is just the beginning.

In the coming months, we will continue to grow the platform in close collaboration with partners across the region, including governments, universities, nonprofits, and industry leaders with the aim of making Aquarium not only useful, but inclusive, community-driven, and impactful.

Here’s what is coming up next:

🔤 More supported languages and domains
🏅 Contributor recognition and incentive systems
📈 Tools to benchmark model performance using contributed datasets
💬 Community discussion forums to foster collaboration and knowledge exchange

🤝Join us today!

Aquarium is part of Project SEALD, and is built with a single vision: to ensure AI systems reflect and serve the rich linguistic diversity of Southeast Asia.

Here’s how you can get involved:

📥 Contribute language datasets

📊 Browse regional dashboards to see where data is missing – and where your contribution could make a big difference.

🌐 Find out more about Project SEALD here

📧 Reach out to collaborate: seald@aisingapore.org

Together, let us build AI that understands, and speaks to & for Southeast Asia.

Introducing Aquarium: An Open Data Platform for Southeast Asian Languages

Like this:

Ready to explore SEA-LION?

Southeast Asia's Open AI Frontier.

MODELS

DEMO

COMMUNITY

RESOURCES

ABOUT

Share this: