About

Southeast Asian Languages in One Network (SEA-LION) is a family of open-source Large Language Models (LLMs) built to better understand Southeast Asia’s (SEA) diverse contexts, languages, and cultures.

This project is anchored by the Products Pillar of AI Singapore and is part of Singapore’s National Multi-Modal LLM Project (NMLP).

Our work on SEA-LION aims to create LLMs that cater to under-represented population groups and low-resource languages in the SEA region.

Existing LLMs often display strong biases in cultural values, political beliefs, and social attitudes. This is due to their training data, especially those scraped from the Internet, having heavily influenced content from Western, industrialized, rich, educated, and democratic (WIRED) societies. People from non-WIRED societies are less likely to be literate, use the internet, or have their output easily accessible, leading to a significant imbalance in the data.

SEA-LION addresses this by being trained on a greater volume of content produced in Southeast Asian languages like Thai, Vietnamese, and Bahasa Indonesia. This ensures better data representation and alignment compared to Western or Chinese models. SEA-LION models understand the nuances in SEA languages and demonstrate a greater awareness of the region’s cultural context.

This lowers the barrier to adoption for governments, enterprises, academia, and end-users while effectively expanding Southeast Asian languages and cultural representation in mainstream LLMs which are currently dominated by models with a low representation of data from, about and by Southeast Asia.

Large Language Models are a type of artificial intelligence designed to understand and generate human language. They are trained on vast amounts of text data and can perform a wide range of tasks, such as translation, summarization, answering questions, and even writing code.

The SEA-LION project is continually evolving, with each new version marking a step in its roadmap toward a more inclusive and representative AI ecosystem for Southeast Asia

Q2

  • First set of reasoning models trained on Southeast Asia data

Q3

  • SEA-LION v4 (Gemma 27B, Qwen 32B)
  • CPT on both Gemma3 and Qwen3
  • First multimodal release, multiple quantized versions#1 among open models (<200B) for SEA

Q4

  • Multimodality improvements.

Q3

  • Improved performance on SEA tasks while maintaining credible performance on standard English benchmarks.
  • Enhanced conversational abilities across SEA languages.
  • More contextually appropriate responses.

Q4

  • Outperforms similar-sized open-source models.
  • Surpasses some larger models in both general and SEA capabilities.

Outperformed most models based on SEA-HELM benchmarks