Bridging the Semantic Gap: Announcing the SEA-LION Embedding Suite
Today, we are announcing the SEA-LION embedding suite. It’s the BEST embedding for Southeast Asian use.
Proven on SEA-BED
To ensure these models actually perform where it matters, we evaluated them on SEA-BED (Southeast Asia Embedding Benchmark). Unlike global benchmarks that rely on machine translation, SEA-BED uses human-curated data across 10 regional languages, including Tetum, a highly-underserved language in our region. Our models consistently show top-tier performance across: At the same time, we managed to keep English and Chinese performance as close to SOTA as possible.
- Retrieval & Reranking
- Semantic Textual Similarity (STS)
- Bitext Mining
- Instructional Retrieval
Performance Comparison


| Model | Organisation | Size | SEA-BED (SEA) | MTEB (EN) | CMTEB (ZH) | Document Throughput (doc/s) ↑ | Time per Document (ms) ↓ |
|---|---|---|---|---|---|---|---|
| SEA-LION-E5-Embedding-600M | AISG | 0.6B | 80.03 | 61.41 | 60.79 | 79.77 | 12.54 |
| E5-large | Microsoft | 0.6B | 78.93 | 61.21 | 60.51 | 81.45 | 12.28 |
| SEA-LION- ModernBERT-Embedding-600M | AISG | 0.6B | 78.45 | 60.64 | 60.47 | 91.16 | 10.97 |
| Qwen-8B-Embedding | Alibaba | 8B | 77.26 | 68.71 | 75.00 | 3.20 | 312.19 |
| BGE-M3 | BAAI | 0.6B | 76.46 | – | – | 86.25 | 11.59 |
| SEA-LION- ModernBERT-Embedding-300M | AISG | 0.3B | 76.00 | 58.21 | 58.20 | 127.06 | 7.87 |
| sentence-transformers/LaBSE | 0.5B | 74.99 | 49.45 | – | 108.40 | 9.22 | |
| Embedding Gemma | 0.3B | 70.44 | 65.11 | – | 32.86 | 30.43 | |
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | Microsoft | 0.3B | 65.40 | 55.71 | – | 59.10 | 16.92 |
| Qwen-0.6B-Embedding | Alibaba | 0.6B | 60.66 | 64.72 | 67.45 | 13.03 | 76.75 |
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | Microsoft | 0.1B | 54.39 | 53.99 | – | 57.36 | 17.43 |
| text-embedding-3-small | OpenAI | – | 52.89 | – | – | could not process | could not process |
Notes:
- Throughput: Calculations are based on an inference environment utilising 8x NVIDIA H200 GPUs.
- Time per Document: Measured in milliseconds using the official translations of the ASEAN Charter.
- Speed Test Configuration: Documents were processed in chunks, with each chunk size set to the model’s maximum context length minus 2.
- Averaging: Both Throughput and Time per Document results represent the average performance across 100 independent runs to ensure statistical significance.
- Benchmark Scoring: All SEA-BED performance scores (including reference models) are calculated in-house to ensure direct comparability. For MTEB and CMTEB scores, performance scores for SEA-LION models are calculated in-house, while scores for other reference models are taken from the official MTEB/CMTEB leaderboards without further validation. – indicates that the reported results are either incomplete or not available on the respective leaderboards.
Try It Yourself: SEA-LION Embedding Demo
To help you get started, we have provided a practical demo showcasing how to integrate these embedding models into your RAG workflows. The demo includes source code for document indexing, semantic search, and performance benchmarking across various SEA languages.
You can access the full source code and documentation at the sealion-embedding-demo GitHub repository.
Why Embeddings Matter for the Region
While Large Language Models (LLMs) grab the headlines, Embeddings are the silent engine behind modern AI. They are critical for current LLM systems, specifically for memory, search, and retrieval (RAG). Whether you are building a chatbot for a government portal in Bangkok, a customer support bot for an e-commerce platform in Jakarta, or a legal document analyser in Hanoi, high-quality embeddings are the difference between “getting the gist” and “truly understanding.”
This release includes five specialised models and four foundational checkpoints across three distinct architectures, all unified by one goal: providing the best vector representations for Southeast Asia’s unique linguistic landscape.
We are releasing these models under the MIT License to support developers, researchers, and startups across the ASEAN region.
The Lineup: Precision for Every Use Case
For those who require the highest possible accuracy for embedding tasks such as retrieval-augmented generation (RAG), clustering, semantic text similarity (STS), or reranking, we recommend our SEA-LION embedding suite. Our suite of embedding models are designed to balance state-of-the-art accuracy, speed, and memory efficiency.
1. The Powerhouse: SEA-LION-E5-Embedding-600M
We tested a large pool of models and selected the E5-Large for fine-tuning on our curated SEA data, which achieves SOTA performance. This model is a drop-in replacement for existing agentic workflows, specifically optimised to understand the nuances in languages like Thai, Vietnamese, and Indonesian.
2.Efficient Embedding: SEA-LION-ModernBERT-Embedding (300M & 600M)
We trained the SEA-LION-ModernBERT-Embedding (300M & 600M) flagship encoders which were built from scratch using the ModernBERT architecture and the Gemma 3 tokenizer. Designed for maximum performance in a small footprint, providing dense, high-quality embeddings while remaining the fastest and most capable options for long-document RAG in the region.
Further fine-tuned specifically for multilingual embedding tasks, the SEA-LION-Embedding-300M and 600M models feature native 8,192 token context windows and alternating attention mechanisms to handle complex semantic tasks with ease.
3. More Checkpoints for SEA Developers: SEA-LION-ModernBERT (300M & 600M)
We are releasing the raw pre-trained checkpoints (300M and 600M), built from scratch, for the community to build upon. These checkpoints represent the culmination of our multi-stage training pipeline, including 2T tokens of pre-training and 1T tokens of mid-training across 13 regional languages.
What’s Under the Hood?
Better Tokenization with Gemma 3
All our new models utilise the Gemma 3 tokenizer, which has a comparatively lower fertility rate for SEA languages. This means the models can process more information per token for non-Latin scripts like Khmer, Lao, and Burmese.
Get Started Today
The models and checkpoints are now available on our Hugging Face hub.
- Explore the Models: Hugging Face / AI Singapore
- Review the Benchmark: SEA-BED paper & dataset
If you are building with SEA-LION, we want to hear from you! Join our community and let’s shape the future of Southeast Asian AI together. sealion@aisingapore.org
