Assessing LLM Performance for Southeast Asia: Introducing SEA-HELM

In collaboration with the HELM team at Stanford CRFM (with special thanks to Yifan Mai), we would like to announce the official release of SEA-HELM (Southeast Asian Holistic Evaluation of Language Models), a general-purpose holistic benchmark to evaluate the performance of LLMs in the Southeast Asian context.

Distinctive Features

  • SEA-HELM (leaderboard here; paper here; code here), previously known as BHASA, is a multilingual and multicultural evaluation benchmark that holistically assesses the performance of LLMs in the Southeast Asia (SEA) region.
  • We are closing the gap in the evaluation of LLMs’ performance for SEA by working in close collaboration with native speakers to create new benchmarks or to authentically translate and localise English benchmarks to cater comprehensively to the SEA region.
  • We constructed SEA Linguistics, also known as LINDSEA (LINguistic Diagnostics for Southeast Asian languages), as a part of SEA-HELM, a benchmark specifically targets and verifies the models’ understanding of morphological, syntactic, semantic and pragmatic aspects of SEA languages.

Cutting-edge Components

  • Languages: Filipino, Indonesian, Javanese, Sundanese, Tamil, Thai, Vietnamese.
    • Javanese and Sundanese datasets were built in collaboration with the GoTo team.
    • More languages will be added as we are actively expanding our SEA languages coverage
  • Key SEA capabilities captured in SEA-HELM suite
We taxonomise the LLM problem space into five key evaluation pillars in SEA-HELM and include various competencies under each pillar, which makes up our holistic and integrated approach
  • NLP classics: The NLP classics framework evaluates Natural Language Understanding with tasks like Extractive QA and Sentiment Analysis, Natural Language Generation with translation and summarisation, and Natural Language Reasoning with Causal Reasoning and Natural Language Inference, using datasets developed in their respective native languages to avoid translationese.
  • LLM-specifics: We develop automated evaluation metrics focusing on instruction-following and human-like conversation, leveraging our 100% native speaker-validated SEA-IFEval and SEA-MTBench datasets for localised, culturally nuanced assessments, ensuring accurate representation and fair comparison of model capabilities.
  • SEA Linguistics: LINDSEA is a pioneering, high-quality linguistic dataset for Southeast Asian languages that evaluates models’ language proficiency and grammatical understanding through a detailed taxonomy of syntactic, semantic, and pragmatic phenomena.
  • SEA Culture: Cultural representation and bias are crucial in LLMs to avoid social harm. Our Filipino cultural dataset, Kalahi, uses a participatory approach to ensure LLMs provide culturally-relevant responses for Filipino contexts.
  • Safety: Tailoring safety benchmarks for SEA languages is crucial to prevent LLMs from generating unsafe outputs in lower-resource languages, with efforts focused on creating inclusive, representative datasets for tasks like toxicity detection.

Counts for Selected Tasks in SEA-HELM

CompetencyTaskLanguageCount
NLP Classic – NLGTranslation (English to Native Language)Indonesian1017
Vietnamese1017
Thai1017
Tamil1017
Filipino605
Total4673
Translation (Native Language to English)Indonesian1017
Vietnamese1017
Thai1017
Tamil1017
Filipino605
Total4673
Translation from Indonesian to Native LanguageJavanese399
Sundanese399
Total798
Translation from Native Language to IndonesianJavanese399
Sundanese399
Total798
Abstractive SummarisationIndonesian105
Vietnamese105
Thai105
Tamil105
Filipino105
Total525
Total11467
NLP Classic – NLRNatural Language InferenceIndonesian1005
Vietnamese1005
Thai1005
Tamil1005
Filipino605
Total4625
Causal ReasoningIndonesian505
Vietnamese505
Thai505
Tamil505
Filipino405
Total2425
Total7050
NLP Classic – NLUMetaphor UnderstandingIndonesian300
Javanese287
Sundanese300
Total887
Multiple-Choice Question Answering (qa-mc)Javanese285
Sundanese285
Total570
Question Answering* (qa)

*also multiple choice
Indonesian105
Vietnamese105
Thai105
Tamil105
Total420
ParaphraseFilipino405
Total405
Belebele Multiple-Choice Question AnsweringIndonesian900
Tamil900
Filipino105
Thai900
Vietnamese900
Total3705
Sentiment AnalysisIndonesian405
Vietnamese1005
Thai1005
Tamil1005
Filipino605
Javanese399
Sundanese399
Total4823
Total10810
LLM-Specific: Instruction FollowingInstruction/Format Following (IF-Eval)Indonesian105
Vietnamese105
Thai105
Filipino105
Javanese105
Sundanese105
Total630
Total630
LLM-Specific: ChatMulti-Turn Chat (MT-Bench)Indonesian58
Vietnamese58
Thai91
Filipino58
Javanese58
Sundanese58
Total381
Total381
SEA SafetyToxicityIndonesian1005
Vietnamese1005
Thai1005
Filipino405
Total3420
Total3420
SEA LinguisticsPragmatic True/False task (pragmatic-single)Indonesian105
Tamil85
Total190
Pragmatic Context Understanding (pragmatic-pair)Indonesian89
Tamil89
Total178
Syntax (mp-r)Indonesian385
Tamil475
Total860
Total1228
SEA CultureKalahiFilipino150
Total150
Total150
SEA-HELM Total35136

Collaboration

Invitation to contribute

  • AI Singapore, launched in May 2017, brings together all Singapore-based research institutions and the vibrant ecosystem of AI start-ups and companies developing AI products to perform use-inspired research, grow knowledge, create tools, and develop the talent to power Singapore’s AI efforts.
  • AI Singapore is committed to open-sourcing all artifacts for the benefit of Southeast Asia. We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-HELM.
  • If there are areas of research topics where you think that AISG and yourself can work together to work on a common research agenda that can further the work of SEA-HELM, please feel free to reach out to us as well. We will be open to joint publications arising from these collaborations too.
  • If you are interested in any form of collaboration, please feel free to reach out to sealion@aisingapore.org.