Bridging the Semantic Gap: Announcing the SEA-LION Embedding Suite

Today, we are announcing the SEA-LION embedding suite. It’s the BEST embedding for Southeast Asian use.

Proven on SEA-BED

To ensure these models actually perform where it matters, we evaluated them on SEA-BED (Southeast Asia Embedding Benchmark). Unlike global benchmarks that rely on machine translation, SEA-BED uses human-curated data across 10 regional languages, including Tetum, a highly-underserved language in our region. Our models consistently show top-tier performance across: At the same time, we managed to keep English and Chinese performance as close to SOTA as possible.

  • Retrieval & Reranking
  • Semantic Textual Similarity (STS)
  • Bitext Mining
  • Instructional Retrieval

Performance Comparison

Radar chart displaying SEA-BED performance metrics across various Southeast Asian languages including Vietnamese, Filipino, Tamil, Thai, Malay, Indonesian, Lao, Khmer, and Burmese. The chart illustrates different embedding models and their effectiveness, with colour-coded lines representing different datasets.
Overview of the landscape of embedding models for Southeast Asian languages
A radar chart comparing SEA-BED performance metrics for various languages including Vietnamese, Filipino, Tamil, Malay, Indonesian, Lao, Khmer, Burmese, Thai, and Tetum, with two model embeddings represented in different colours.
Comparing our SEA-LION embedding models against other top-of-class embedding models for Southeast Asian languages
ModelOrganisationSizeSEA-BED
(SEA)
MTEB
(EN)
CMTEB
(ZH)
Document Throughput (doc/s)
Time per Document (ms)
SEA-LION-E5-Embedding-600MAISG0.6B80.0361.4160.7979.7712.54
E5-largeMicrosoft0.6B78.9361.2160.5181.4512.28
SEA-LION- ModernBERT-Embedding-600MAISG0.6B78.4560.6460.4791.1610.97
Qwen-8B-EmbeddingAlibaba8B77.2668.7175.003.20312.19
BGE-M3BAAI0.6B76.4686.2511.59
SEA-LION- ModernBERT-Embedding-300MAISG0.3B76.0058.2158.20127.067.87
sentence-transformers/LaBSEGoogle0.5B74.9949.45108.409.22
Embedding GemmaGoogle0.3B70.4465.1132.8630.43
sentence-transformers/paraphrase-multilingual-mpnet-base-v2Microsoft0.3B65.4055.7159.1016.92
Qwen-0.6B-EmbeddingAlibaba0.6B60.6664.7267.4513.0376.75
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2Microsoft0.1B54.3953.9957.3617.43
text-embedding-3-smallOpenAI52.89could not processcould not process

Notes:

  • Throughput: Calculations are based on an inference environment utilising 8x NVIDIA H200 GPUs.
  • Time per Document: Measured in milliseconds using the official translations of the ASEAN Charter.
  • Speed Test Configuration: Documents were processed in chunks, with each chunk size set to the model’s maximum context length minus 2.
  • Averaging: Both Throughput and Time per Document results represent the average performance across 100 independent runs to ensure statistical significance.
  • Benchmark Scoring: All SEA-BED performance scores (including reference models) are calculated in-house to ensure direct comparability. For MTEB and CMTEB scores, performance scores for SEA-LION models are calculated in-house, while scores for other reference models are taken from the official MTEB/CMTEB leaderboards without further validation. indicates that the reported results are either incomplete or not available on the respective leaderboards.

Try It Yourself: SEA-LION Embedding Demo

To help you get started, we have provided a practical demo showcasing how to integrate these embedding models into your RAG workflows. The demo includes source code for document indexing, semantic search, and performance benchmarking across various SEA languages.

You can access the full source code and documentation at the sealion-embedding-demo GitHub repository.

Why Embeddings Matter for the Region

While Large Language Models (LLMs) grab the headlines, Embeddings are the silent engine behind modern AI. They are critical for current LLM systems, specifically for memory, search, and retrieval (RAG). Whether you are building a chatbot for a government portal in Bangkok, a customer support bot for an e-commerce platform in Jakarta, or a legal document analyser in Hanoi, high-quality embeddings are the difference between “getting the gist” and “truly understanding.”

This release includes five specialised models and four foundational checkpoints across three distinct architectures, all unified by one goal: providing the best vector representations for Southeast Asia’s unique linguistic landscape.

We are releasing these models under the MIT License to support developers, researchers, and startups across the ASEAN region.

The Lineup: Precision for Every Use Case

For those who require the highest possible accuracy for embedding tasks such as retrieval-augmented generation (RAG), clustering, semantic text similarity (STS), or reranking, we recommend our SEA-LION embedding suite. Our suite of embedding models are designed to balance state-of-the-art accuracy, speed, and memory efficiency.

1. The Powerhouse: SEA-LION-E5-Embedding-600M

We tested a large pool of models and selected the E5-Large for fine-tuning on our curated SEA data, which achieves SOTA performance. This model is a drop-in replacement for existing agentic workflows, specifically optimised to understand the nuances in languages like Thai, Vietnamese, and Indonesian.

2.Efficient Embedding: SEA-LION-ModernBERT-Embedding (300M & 600M)

We trained the SEA-LION-ModernBERT-Embedding (300M & 600M) flagship encoders which were built from scratch using the ModernBERT architecture and the Gemma 3 tokenizer. Designed for maximum performance in a small footprint, providing dense, high-quality embeddings while remaining the fastest and most capable options for long-document RAG in the region.

Further fine-tuned specifically for multilingual embedding tasks, the SEA-LION-Embedding-300M and 600M models feature native 8,192 token context windows and alternating attention mechanisms to handle complex semantic tasks with ease.

3. More Checkpoints for SEA Developers: SEA-LION-ModernBERT (300M & 600M)

We are releasing the raw pre-trained checkpoints (300M and 600M), built from scratch, for the community to build upon. These checkpoints represent the culmination of our multi-stage training pipeline, including 2T tokens of pre-training and 1T tokens of mid-training across 13 regional languages.

What’s Under the Hood?

Better Tokenization with Gemma 3

All our new models utilise the Gemma 3 tokenizer, which has a comparatively lower fertility rate for SEA languages. This means the models can process more information per token for non-Latin scripts like Khmer, Lao, and Burmese.

Get Started Today

The models and checkpoints are now available on our Hugging Face hub.

If you are building with SEA-LION, we want to hear from you! Join our community and let’s shape the future of Southeast Asian AI together. sealion@aisingapore.org