SEA-LION v3: 128K Context Length and 70B Models

We are excited to announce the release of two new variants for SEA-LION v3, our latest large language models tailored specifically for Southeast Asian languages. Building upon Meta’s Llama and SEA-LION’s data, these variants have strong capabilities in handling diverse linguistic and cultural nuances inherent to Southeast Asian region.

New SEA-LION v3 Variants

1. SEA-LION v3 8B

Architecture	Based on Llama 3.1 8B
Parameters	8 billion
Context Length / Performance	Features a large context length of 128K tokens, enabling the model to handle extensive and complex dialogues effectively.
Use Case	Ideal for applications requiring deep contextual understanding and long-form content processing.

2. SEA-LION v3 70B

Architecture	Based on Llama 3.1 70B
Parameters	70 billion
Context Length / Performance	Our largest model to date (as of Dec 2024), also with 128K context length, offering superior performance metrics compared to its predecessors and contemporaries.
Use Case	Suited for high-demand environments where advanced reasoning and comprehensive language comprehension are essential.

Technical Enhancements in SEA-LION v3

Continued Pre-Training

Both variants underwent continued pre-training on Llama 3.1 using an additional 200 billion tokens of Southeast Asian data. This extensive training enhances the models’ understanding of regional languages and cultural contexts, resulting in significant performance boosts in languages such as Thai, Vietnamese, Tamil, and Indonesian.

Post-Training

Both variants undergo supervised fine-tuning (SFT) in two stages:

Stage 1: Focuses on math and reasoning instructions using approximately 9.5 million instructions, predominantly in English.
Stage 2: Emphasizes chat and instruction-following tasks with around 7.3 million instructions, including a substantial portion in Southeast Asian languages.

This fine-tuning process, combined with model merging techniques, ensures that SEA-LION v3 maintains its superior performance while mitigating issues like catastrophic forgetting.

Multilingual Proficiency

SEA-LION v3 supports up to 13 languages, including newly added languages like Javanese and Sundanese. This multilingual capability ensures that our models can cater to a wide array of Southeast Asian languages, fostering greater accessibility and usability across the region. In addition, our experiments show some modest cross-lingual transfer which help languages that are not as well representated in digital data.

Training Infrastructure

Hardware: Utilized MosaicML Composer on AWS p5e.48xlarge and SingTel HGX-100 instances equipped with Nvidia H200 and H100 GPUs.
Training Duration: The 8B variant was trained for approximately 136 hours, while the 70B variant underwent 495 hours.
Configuration: Both models employ bfloat16 precision, decoupled_adamw optimizer, and a global batch size of 512.

Evaluation Metrics

SEA-LION v3 variants have been rigorously evaluated using both English and Southeast Asian benchmarks:

English Evaluation: Utilizes tasks from the Open LLM Leaderboard v2, including MMLU-PRO, MUSR, and others.
Southeast Asian Evaluation: Employs SEA-HELM metrics covering sentiment analysis, toxicity detection, causal reasoning, and more, tailored to regional languages.

Based on these narrow benchmarks, SEA-LION v3 outperforms many open source and even larger models like Llama 3.3 70B Instruct in several benchmarks, establishing new standards for regional AI capabilities. See our Leaderboard for details.

Accessibility and Availability

All SEA-LION v3 variants are open-source and freely available for research and commercial use. Developers and enterprises can immediately access the models on platforms such as Hugging Face, Kaggle, and Ollama.

Partner Models

We are also happy to share a few models built by our partners and collaborators in the region:

Indonesia	sahabat-ai
Thailand	WangchanLION v2

Acknowledgments

We extend our gratitude to our partners and collaborators across Southeast Asia.

We are also grateful for the support of the Infocomm Media Development Authority (IMDA) of Singapore.

SEA-LION v3: 128K Context Length and 70B Models

New SEA-LION v3 Variants

1. SEA-LION v3 8B

2. SEA-LION v3 70B

Technical Enhancements in SEA-LION v3

Continued Pre-Training

Post-Training

Multilingual Proficiency

Training Infrastructure

Evaluation Metrics

Accessibility and Availability

Hugging Face

Ollama

Kaggle

Partner Models

Acknowledgments

Like this:

Ready to explore SEA-LION?

Southeast Asia's Open AI Frontier.

MODELS

DEMO

COMMUNITY

RESOURCES

ABOUT

New SEA-LION v3 Variants

1. SEA-LION v3 8B

2. SEA-LION v3 70B

Technical Enhancements in SEA-LION v3

Continued Pre-Training

Post-Training

Multilingual Proficiency

Training Infrastructure

Evaluation Metrics

Accessibility and Availability

Hugging Face

Ollama

Kaggle

Partner Models

Acknowledgments

Share this:

Like this:

Ready to explore SEA-LION?

Southeast Asia's Open AI Frontier.

MODELS

DEMO

COMMUNITY

RESOURCES

ABOUT