Project Aquarium: Mapping Language Data for a More Inclusive AI Future, From Southeast Asia to the World
AI is poised to transform the world, and to ensure this transformation benefits everyone, it must be inclusive of all languages and cultures. This is particularly crucial in regions like Southeast Asia, where a wealth of local languages and deep cultural nuances are often overlooked by today’s global Large Language Models (LLMs) despite being home to over 1,300 different languages —accounting for about 20% of the world’s total.
AI Singapore (AISG) is dedicated to closing this critical gap, striving to ensure that the future of AI accurately reflects the rich diversity of languages and cultures of the communities. This mission is bolstered by the continued support of our SEA-LION partners and collaborators. As announced at Google for Singapore, Google.org has provided US$1 million funding to AISG to initiate Project Aquarium.
Project Aquarium is a first-of-its-kind regional initiative to build a shared platform for collecting, curating, and annotating high-quality language datasets that reflect the region’s linguistic and cultural diversity. The project supports a sustainable language data ecosystem by aiming towards fair recognition and compensation mechanisms for contributors, while conducting regional data mapping to identify existing resources and gaps. Rather than reinventing the wheel, Project Aquarium fosters collaboration across academia, industry, and communities to build on existing efforts. It also emphasizes cultural data, safety, and responsible practices to ensure alignment across global languages, especially those that remain underrepresented. The platform will initially strengthen resources for Southeast Asian regional languages spoken by more than 700 million people, including Burmese, Khmer, Thai, Malay, Indonesian, Filipino, Tamil, and Lao.
Guided by a forward-looking roadmap, we will progressively extend our capabilities to a far wider range of languages, including those that have long been underrepresented on the global stage.

Project Aquarium Data Platform Interface
Why This Matters
Modern AI depends heavily on data. Large language models learn patterns, context, and meaning from the datasets used to train them. When certain languages lack sufficient high-quality data, those languages and the communities who speak them, become underrepresented in AI systems.
When AI fails to capture linguistic diversity, the impact extends beyond technology. It affects access to digital services, the accuracy of information systems, and the ability of communities to participate fully in an AI-driven world. Ensuring language inclusion in AI is therefore not just a regional issue, it is a global priority.
What Is Project Aquarium
Project Aquarium is a purpose-built platform designed to support the full lifecycle of language data development. It serves as a centralized infrastructure to collect, host, and maintain datasets that enable researchers and developers to build AI systems attuned to global linguistic and cultural realities, with the focus on the underrepresented languages.
Through Project Aquarium, contributors will be able to:
- Collect language datasets from diverse sources
- Curate and structure linguistic data for AI training
- Annotate datasets to capture context and cultural nuance
- Host and share datasets through a dedicated platform
What’s Next
Project Aquarium represents a long-term investment in building a more inclusive global AI ecosystem. Future phases will focus on expanding language coverage, growing contributor communities, enhancing annotation tools, and deepening integration with AI model development.
We will continue expanding and strengthening collaborations with partners globally to ensure Project Aquarium remains relevant and useful for advancing more inclusive and representative AI. AI Singapore will tap on our Project SEALD community network and work with organisations including AI4Bharat (India), AI Forum (Cambodia), E-CAIR (Philippines), KORIKA (Indonesia), VISTEC (Thailand) to support collaboration, participation, and shared progress.
Stay tuned for exciting updates!
A Call for Global Collaboration
Building inclusive AI cannot be done by a single organization or research team. It requires participation from the communities who speak these languages, as well as collaboration across academia, industry, and civil society.
AI Singapore invites:
- Researchers and academic institutions working on language technologies
- Developers and AI practitioners building multilingual applications
- Linguists and language experts with deep knowledge of local contexts
- Community and Individual contributors who can help collect and annotate real-world language data
Your contributions, whether datasets, annotations, research insights, or community engagement can help ensure that the next generation of AI understands the richness and diversity of human language. By strengthening global language data and empowering local contributors, we can ensure that the future of AI truly reflects our languages, cultures, and voices.
Join us in contributing to Project Aquarium and helping shape a more inclusive global AI ecosystem. To explore collaboration opportunities, please reach out to us at aquarium@aisingapore.org or visit our website to learn more about Project Aquarium and ongoing initiatives.
