Assessing LLM Performance for Southeast Asia: Introducing SEA-HELM
In collaboration with the HELM team at Stanford CRFM (with special thanks to Yifan Mai), we would like to announce the official release of SEA-HELM (Southeast Asian Holistic Evaluation of Language Models), a general-purpose holistic benchmark to evaluate the performance of LLMs in the Southeast Asian context.
Distinctive Features
- SEA-HELM (leaderboard here; paper here; code here), previously known as BHASA, is a multilingual and multicultural evaluation benchmark that holistically assesses the performance of LLMs in the Southeast Asia (SEA) region.
- We are closing the gap in the evaluation of LLMs’ performance for SEA by working in close collaboration with native speakers to create new benchmarks or to authentically translate and localise English benchmarks to cater comprehensively to the SEA region.
- We constructed SEA Linguistics, also known as LINDSEA (LINguistic Diagnostics for Southeast Asian languages), as a part of SEA-HELM, a benchmark specifically targets and verifies the models’ understanding of morphological, syntactic, semantic and pragmatic aspects of SEA languages.
Cutting-edge Components
- Languages: Filipino, Indonesian, Javanese, Sundanese, Tamil, Thai, Vietnamese.
- Javanese and Sundanese datasets were built in collaboration with the GoTo team.
- More languages will be added as we are actively expanding our SEA languages coverage
- Key SEA capabilities captured in SEA-HELM suite

- NLP classics: The NLP classics framework evaluates Natural Language Understanding with tasks like Extractive QA and Sentiment Analysis, Natural Language Generation with translation and summarisation, and Natural Language Reasoning with Causal Reasoning and Natural Language Inference, using datasets developed in their respective native languages to avoid translationese.
- LLM-specifics: We develop automated evaluation metrics focusing on instruction-following and human-like conversation, leveraging our 100% native speaker-validated SEA-IFEval and SEA-MTBench datasets for localised, culturally nuanced assessments, ensuring accurate representation and fair comparison of model capabilities.
- SEA Linguistics: LINDSEA is a pioneering, high-quality linguistic dataset for Southeast Asian languages that evaluates models’ language proficiency and grammatical understanding through a detailed taxonomy of syntactic, semantic, and pragmatic phenomena.
- SEA Culture: Cultural representation and bias are crucial in LLMs to avoid social harm. Our Filipino cultural dataset, Kalahi, uses a participatory approach to ensure LLMs provide culturally-relevant responses for Filipino contexts.
- Safety: Tailoring safety benchmarks for SEA languages is crucial to prevent LLMs from generating unsafe outputs in lower-resource languages, with efforts focused on creating inclusive, representative datasets for tasks like toxicity detection.
Counts for Selected Tasks in SEA-HELM
| Competency | Task | Language | Count |
| NLP Classic – NLG | Translation (English to Native Language) | Indonesian | 1017 |
| Vietnamese | 1017 | ||
| Thai | 1017 | ||
| Tamil | 1017 | ||
| Filipino | 605 | ||
| Total | 4673 | ||
| Translation (Native Language to English) | Indonesian | 1017 | |
| Vietnamese | 1017 | ||
| Thai | 1017 | ||
| Tamil | 1017 | ||
| Filipino | 605 | ||
| Total | 4673 | ||
| Translation from Indonesian to Native Language | Javanese | 399 | |
| Sundanese | 399 | ||
| Total | 798 | ||
| Translation from Native Language to Indonesian | Javanese | 399 | |
| Sundanese | 399 | ||
| Total | 798 | ||
| Abstractive Summarisation | Indonesian | 105 | |
| Vietnamese | 105 | ||
| Thai | 105 | ||
| Tamil | 105 | ||
| Filipino | 105 | ||
| Total | 525 | ||
| Total | 11467 | ||
| NLP Classic – NLR | Natural Language Inference | Indonesian | 1005 |
| Vietnamese | 1005 | ||
| Thai | 1005 | ||
| Tamil | 1005 | ||
| Filipino | 605 | ||
| Total | 4625 | ||
| Causal Reasoning | Indonesian | 505 | |
| Vietnamese | 505 | ||
| Thai | 505 | ||
| Tamil | 505 | ||
| Filipino | 405 | ||
| Total | 2425 | ||
| Total | 7050 | ||
| NLP Classic – NLU | Metaphor Understanding | Indonesian | 300 |
| Javanese | 287 | ||
| Sundanese | 300 | ||
| Total | 887 | ||
| Multiple-Choice Question Answering (qa-mc) | Javanese | 285 | |
| Sundanese | 285 | ||
| Total | 570 | ||
| Question Answering* (qa) *also multiple choice | Indonesian | 105 | |
| Vietnamese | 105 | ||
| Thai | 105 | ||
| Tamil | 105 | ||
| Total | 420 | ||
| Paraphrase | Filipino | 405 | |
| Total | 405 | ||
| Belebele Multiple-Choice Question Answering | Indonesian | 900 | |
| Tamil | 900 | ||
| Filipino | 105 | ||
| Thai | 900 | ||
| Vietnamese | 900 | ||
| Total | 3705 | ||
| Sentiment Analysis | Indonesian | 405 | |
| Vietnamese | 1005 | ||
| Thai | 1005 | ||
| Tamil | 1005 | ||
| Filipino | 605 | ||
| Javanese | 399 | ||
| Sundanese | 399 | ||
| Total | 4823 | ||
| Total | 10810 | ||
| LLM-Specific: Instruction Following | Instruction/Format Following (IF-Eval) | Indonesian | 105 |
| Vietnamese | 105 | ||
| Thai | 105 | ||
| Filipino | 105 | ||
| Javanese | 105 | ||
| Sundanese | 105 | ||
| Total | 630 | ||
| Total | 630 | ||
| LLM-Specific: Chat | Multi-Turn Chat (MT-Bench) | Indonesian | 58 |
| Vietnamese | 58 | ||
| Thai | 91 | ||
| Filipino | 58 | ||
| Javanese | 58 | ||
| Sundanese | 58 | ||
| Total | 381 | ||
| Total | 381 | ||
| SEA Safety | Toxicity | Indonesian | 1005 |
| Vietnamese | 1005 | ||
| Thai | 1005 | ||
| Filipino | 405 | ||
| Total | 3420 | ||
| Total | 3420 | ||
| SEA Linguistics | Pragmatic True/False task (pragmatic-single) | Indonesian | 105 |
| Tamil | 85 | ||
| Total | 190 | ||
| Pragmatic Context Understanding (pragmatic-pair) | Indonesian | 89 | |
| Tamil | 89 | ||
| Total | 178 | ||
| Syntax (mp-r) | Indonesian | 385 | |
| Tamil | 475 | ||
| Total | 860 | ||
| Total | 1228 | ||
| SEA Culture | Kalahi | Filipino | 150 |
| Total | 150 | ||
| Total | 150 | ||
| SEA-HELM Total | 35136 | ||
Collaboration
- SEA-HELM is being built in collaboration with the HELM team and especially Yifan Mai at Stanford CRFM (https://crfm.stanford.edu/helm/seahelm/latest/).
- Other collaborators are:
- Project SEALD:
- Harsh Dhand
- Pratyusha Mukherjee
- Dinesh Tewari
- Trevor Cohn
- Partha Talukdar
- Andrea Seow
- SEA-LION Regional Network:
- AI Governance – AI Singapore:
- Markus Labude
- Hakim Norhashim
- Eric Orlowski
- Tristan Koh
- IMDA:
- Vanessa Wilfred
- Chen Hui Ong
- Wan Sie Lee
- Project SEALD:
Invitation to contribute
- AI Singapore, launched in May 2017, brings together all Singapore-based research institutions and the vibrant ecosystem of AI start-ups and companies developing AI products to perform use-inspired research, grow knowledge, create tools, and develop the talent to power Singapore’s AI efforts.
- AI Singapore is committed to open-sourcing all artifacts for the benefit of Southeast Asia. We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-HELM.
- If there are areas of research topics where you think that AISG and yourself can work together to work on a common research agenda that can further the work of SEA-HELM, please feel free to reach out to us as well. We will be open to joint publications arising from these collaborations too.
- If you are interested in any form of collaboration, please feel free to reach out to sealion@aisingapore.org.
