Towards fair and comprehensive multilingual LLM benchmarking

Explore how to design fair, transparent, and representative multilingual evaluations for large language models.

Large language models (LLMs) can now generate complete and coherent sentences in many more languages than most humans. Emergent abilities, such as reasoning, creative writing, and human-agent interactions, have propelled enterprises to adapt these models to suit a wide range of use cases that are far beyond classic NLP tasks. Where in the past, we would train a new model for each use case, today, a single LLM can solve this broad gamut of tasks. Notably, LLMs are demonstrating an ability to overcome linguistic barriers — even for languages that are not targeted during training. Evaluating these LLMs thus requires a fundamental change in methodology, departing from the multilingual benchmarks that were developed for classic NLP tasks.

Automatic translation is not enough.

Many LLM evaluation benchmarks tailored to the new abilities of LLMs are still predominantly in English and reflect Western-centric viewpoints (see Singh et al. 2024 for a discussion). Multilingual benchmarks are commonly translations of these English benchmarks, thereby lacking the cultural nuances found in other languages, and they contain translation errors and biases. This can result in cultural erasure, furthering stereotypical or non-diverse views (see Qadri et al. 2025). Thus, there is a need to develop authentic, human-verified multilingual evaluation datasets and metrics.

Need to rethink evaluation approaches.

Ultimately, what we intend to achieve are fair, transparent, and comprehensive multilingual and multicultural evaluations. In this blog post, we identify and explore three core challenges to developing such evaluations, and we discuss the evaluation approaches taken by two multilingual and multicultural initiatives: SEA-HELM by AI Singapore and the Aya initiative led by Cohere For AI.

Our multilingual and multicultural evaluation initiatives

Throughout this blog post, we will draw from our experiences in two multilingual and multicultural evaluation initiatives.

SEA-HELM (SouthEast Asian Holistic Evaluation of Language Models) is a multilingual and multicultural evaluation benchmark that assesses the performance of LLMs in the Southeast Asia (SEA) region. It covers a diverse range of natural language tasks across multiple languages, including Indonesian, Tamil, Thai, and Vietnamese, with Filipino currently in the pipeline. AI Singapore’s dedicated SEA-LION team keeps the suite up to date in close collaboration with native language experts.

Aya is a global initiative led by Cohere For AI to advance the state-of-art in multilingual AI and bridge gaps between people and cultures across the world. Aya is an open science project involving over 3,000 independent researchers across 119 countries. In 2024, the Aya initiative released massively multilingual instruction fine-tuning data, a generative evaluation suite (which we will delve into below), and several multilingual LLMs (MLLMs), from Aya-101 for 101 languages, over Aya-23 to the latest Aya Expanse for 23 languages.

We now turn to the core challenges that we encountered in both efforts.

Challenge 1: Multilingual, but how multilingual?

When comparing multilingual LLMs, we first need to establish the base of comparison, i.e., select the languages we are comparing them on. Imagine you are building a multilingual model. Would you rather (A) prioritize your model’s language coverage and evaluate your model and competitors on all the languages your model supports, or (B) prioritize fairness in comparisons, and restrict comparisons to languages officially supported by each competitor?

This decision is complicated because documentation of language support is lacking, i.e., many open and commercial models do not explicitly state which languages they support. The question of language support thus has to be approached with guesswork, as the pretraining mix is rarely known and official evaluations are not comprehensive enough. For example, the GPT-4 technical report features just one multilingual benchmark, yet GPT-4 is widely applied across various languages and tasks.

Language support is nuanced and could be established in multiple ways: through inclusion of a language in pre/post-training, success in automatic or human evaluations, or simply waiting for users to express their satisfaction. Furthermore, even if a language is “unsupported,” the model might still perform reasonably well (see the study by Holtermann et al. 2024) or it might have great potential to be adapted to the unsupported language. And, on the contrary, when a multilingual model is tuned to a specific set of languages, it cannot be assumed that it will still sufficiently support the languages on which it was pre-trained.

Solution: The Bender Rule for multilingual models

To improve fairness in multilingual model comparisons, we first need more transparency in language support. As an extension of the Bender Rule (“always name the language(s) that you are working on”), model releases should explicitly state the languages they were trained and evaluated upon in their model card. A language support statement is a responsible choice to make, as this means that users and developers will have expectations regarding output quality.

It can be as simple as in the model card for Aya-23 shown on the left side of Figure 1. It may also be more nuanced and address expected use, as seen in Llama 3.2’s model card, which lists languages with official support, but also gives information on pre-training mix and further use (see the right side of Figure 1).

Figure 1: Supported languages as listed in the model cards for Aya-23 (left) and Llama 3.2 (right).

Evidence for language support should be grounded and demonstrated in evaluation results (see, for instance, Aya-23’s technical report with evaluations for all 23 languages), or by listing training data sources and their languages. Eventually, for even greater precision, this should go beyond languages, documenting domains, scripts, and regional varieties.

While users can run evaluations on unsupported languages, having a clearer indication of the supported versus unsupported languages will allow users to better interpret results and isolate the task of cross-lingual transfer to unsupported languages.

Challenge 2: Lost in aggregation

With the growing abilities of LLMs, we are committed to continuously expanding languages and tasks covered in evaluations. SEA-HELM currently evaluates LLMs on a total of 13 tasks across four SEA languages. Given the wide range of tasks and languages, comparing models requires some form of aggregation of the individual task results to establish a model ranking, for example, by taking the average of all the tasks (as in the Open LLM Leaderboard 2) or by calculating the win rate against a pool of models (as in HELM Classic). However, aggregation of metrics poses a few challenges.

The outliers can distort the aggregated score. These outliers may result from tasks or languages being underrepresented in the training distribution of a model. For example, Aya Expanse has a lower overall rank on SEA-HELM due to its lack of language support for Thai and Tamil, but it performs competitively on Indonesian and Vietnamese.
Metrics have differing scales and dynamic ranges. Additionally, not all tasks have the same baselines. This makes it difficult to understand the meaning behind the differences between scores across tasks and metrics. Furthermore, it means that certain scores implicitly contribute more weight to the final ranking than others.

Kocmi et al. 2024 discusses this problem from the perspective of interpreting translation metrics. For example, a point difference in the BLEU score (an overlap metric) corresponds to a score difference of 0.24 in Comet²²(a deep learning-based metric) with a smaller dynamic range (see Figure 2 for a visual comparison).

Figure 2: Screenshot taken from MT Metrics Thresholds showing how a one-point difference in BLEU compares against other metrics. Note the wide spread of differences, rendering direct comparisons among different scales uninterpretable.

There could be a misalignment between the requirements of the user and the overall aggregated score. A user looking to use LLMs for a translation use case might consider selecting a model that performs well on the particular translation tasks of interest rather than the best overall model according to the aggregation metric.

Solution: Transparent aggregation of task metrics

To ensure that the metrics are as interpretable as possible, any decision regarding the processing and aggregation of metrics should be as transparent as possible.

In SEA-HELM, we ensured that all metrics are normalized so that they have the same range. This is documented alongside the aggregated metrics (see Figure 3). We chose to aggregate our metrics by grouping them into six competencies — natural language understanding, natural language generation, natural language reasoning, linguistic diagnostics, instruction-following and multi-turn chat.

Figure 3: The SEA-HELM leaderboard contains a description of how the scores of the various tasks are normalized.

We present the individual task metrics alongside the aggregated metrics, so that users are able to make their own decisions based on all the available information. The supported languages for each model are also indicated (see Figure 4) to help users interpret the results for models which do not support all the evaluated languages.

Figure 4: The SEA-HELM leaderboard lists supported languages with each model.

Challenge 3: Scaling authentic representation

Currently, there are many benchmarks that are technically multilingual, but there is no single multilingual benchmark that broadly covers many languages, and is representative, and is established (i.e., widely adopted).

Multilingual benchmarks are frequently built on texts sourced from the internet, such as FLORES+ or XLSum. The problem with this approach is that such data may not effectively represent societies that have lower internet penetration or mainly access the internet through privately-owned applications that do not allow open data access (see Ahia et al. 2021). Representation on the web is dependent on suitable keyboards and software. For example, the Burmese font Zawgyi was only developed in 2006, and it was migrated to Unicode in 2019. Finally, registers, domains, and varieties of lower-resource languages in particular are rarely uniformly represented. Tamil datasets, for example, heavily feature literary Tamil, leading to lower support for spoken varieties.

Another popular source for measuring multilingual understanding is knowledge exams, turned into multiple-choice tasks. Their evolution showcases the challenges of achieving sufficient coverage and representation at scale: many developers first used automatic translations of the English MMLU benchmark to scale to their languages of choice, but these lacked human validation to ensure linguistic accuracy and faithfulness of translations. MMMLU then improved the translation of MMLU with professional translators, but it was limited to 14 languages. Global MMLU further improved language coverage (42) and quality. However, for the SEA region, even Global MMLU does not cover Burmese, Tamil, or Thai, all of which are official languages in various SEA countries.

Beyond representing multiple languages, multilingual benchmarks do not necessarily represent multiple cultures, or might not even be culturally authentic. While many benchmarks claim to include a cultural component or are presented as being cultural evaluations, most of the data is either synthetically generated or based on what is readily available on the internet. Importantly, the authors of Global MMLU clearly demonstrate that even with faithful translations, MMLU as a multilingual evaluation fails to capture cultural nuances present in the different target languages and, in fact, contains inherent biases. This was found through human annotations of culturally agnostic and culturally sensitive questions, an approach also taken by the Aya redteaming effort that revealed the importance of cultural contextualization of LLM harm.

Solution 1: A participatory approach

There has been much interest, especially recently, in adopting a participatory approach to data creation, a trend we find to be very encouraging (see, for example, Kirk et al. 2024, Romanou et al. 2024, Romero et al. 2024, Singh et al. 2024, or Urailertprasert et al. 2024). We believe that a participatory approach is necessary to achieve authentic multicultural representation. In SEA-HELM, these are the components that lead to linguistic and culturally authentic representation:

Collaborations with native speakers: The Kalahi project is a grassroots Filipino cultural evaluation suite developed by the SEA-LION team. Focus group discussions were held with community members to first identify topics relevant to their everyday lives. They then wrote prompts and alternative responses. Finally, subsets of other native speakers with varying mixes of income levels, genders and age groups were invited to evaluate the candidate responses for relevance. This project is currently being integrated with the SEA-HELM suite and also expanded into other cultures.
Linguistic diagnostics: A suite of Indonesian and Tamil datasets, handcrafted from scratch, was collaboratively created with native speakers from their respective communities. The inclusion of these linguistic diagnostics ensures that users can determine the models’ understanding of morphological, syntactic, semantic and pragmatic aspects of these languages, all of which are crucial ingredients of natural and fluent language generations (previous works have also developed linguistic diagnostics for English, Mandarin, or Japanese).
Localization: SEA-IFEval is an instruction-following benchmark created collaboratively with native speakers. It is manually translated from the English IFEval benchmark and, crucially, localized. Manual translations ensure faithful and accurate linguistic representation, while localization ensures cultural authenticity and removes any unintended or inherent biases (see Table 1).
Human translations: SEA-MTBench, based on the English MT-Bench, is a multi-turn chat benchmark that is also manually translated and contains localized multi-turn scenario diagnostics. Especially for longer (multi-turn) instructions, manual translations provide realistic human inputs that help us verify whether multilingual chatbot applications can hold conversations in a faithful, accurate and coherent manner.

Table 1: An example from the English IFEval dataset and the localized SEA-IFEval version. In scripts of several Asian languages, such as Tamil, Thai, or Mandarin, the original instruction is impossible and needs to be localized. In this case, we opted to change the requirement from featuring the letter “l” to featuring the number “4” instead.

SEA-HELM places an emphasis on the SEA region, which gives the SEA-LION team a unique opportunity to focus on ensuring linguistic and cultural authenticity across SEA languages. We hope that anyone building models that cover these (or a subset of these) languages would consider our benchmark for measuring performance on them. More generally speaking, this level of attention to detail and amount of expert annotation is commonly found in benchmarks that focus only on one or a few languages, and not in those that scale via machine-translation/generation.

Solution 2: LLM judges for crowdsourced prompts

When evaluating Aya models, we realized that classical NLP tasks and discriminative tests are insufficient predictors of end-user experience. We found open-ended generation tasks to be more representative and more challenging, but how do we create and evaluate these tasks for all languages of interest at scale? Here are our solutions:

Table 2: Sample prompts from the Aya evaluation suite: human-annotated prompts, selected prompts from Dolly-15k, and a subset of ArenaHard prompts.

Crowdsourced prompts: In a first participatory approach, we collected prompts representative of target languages and cultures from the Aya community and released them as part of the Aya evaluation suite. This did not scale sufficiently, as not all languages received the same number of contributions. As a secondary approach, we addressed multilingual scaling via translations of English prompts, first for Databrick’s employee-submitted English dolly prompts. However, these were not challenging enough to detect MLLM progress with recent models, so we settled on the hardest user-submitted English prompts from the Chatbot Arena (Li et al. 2024; translations released as m-ArenaHard) for Aya Expanse evaluations. See Table 2 above for English examples from each source.
LLM judges: Arenas for LLM evaluation lets users choose the preferred generation, but for offline evaluations and benchmarking, we can only sparingly rely on human evaluations. Despite known limitations (biases, language coverage, reliability and costs), using LLM judges as a proxy for human evaluations is the most scalable and practical solution (see Figure 5 below). When deploying LLM judges, it is important to investigate their preferences, both in comparison to humans, and in comparison to other LLM judges. For Aya, we found that GPT-4 sufficiently aligned with our human ratings in the six languages that we tested, but we also learned that humans and GPT-4 might resolve corner cases differently. For instance, we observed that GPT-4 was more lenient with mistakes in lengthy responses (details in the Aya model paper). When evaluating Aya Expanse, we compared GPT4-o and GPT-4o-mini as judges and found that GPT-4o-mini would show an up to 10% stronger preference for Aya Expanse. Thus, different judges might tell different stories, especially in the diverse landscape of multilingual models.

Figure 5: Win-rate evaluation results for multiple languages from comparing Aya Expanse 8B with Gemma-2 9B on m-ArenaHard prompts.

With the growing multilinguality of LLMs and a wider distributed user base, arenas will receive more non-English and culturally diverse submissions, and more diverse user experiences can be studied, so that our evaluations will eventually become more representative. At the same time, more work is needed to build and identify strong MLLM judges, as with multilingual reward bench, and to find ways to overcome LLM biases, such as with juries of judges.

Practical recommendations

We hope that the three challenges we have discussed above have given you an overview of the intricate work that goes behind designing multilingual evaluation datasets for LLMs. To summarize, we’ll leave you with a set of practical recommendations.

For ML practitioners invested in developing and evaluating multilingual language models, we promote testing using culturally representative datasets and culturally relevant tasks. Achieving this will require extensive collaborations with communities of native speakers to capture authentic data. When these models are then released, it is essential that they are transparent about the languages and task domains they have been trained on to enable downstream users to accurately estimate their capabilities.

For owners of evaluation leaderboards, we suggest aggregating results across languages, tasks, and models in an easily interpretable manner. These leaderboards should also provide more detailed breakdowns, with descriptions of how each metric relates to model performance. There is also space for innovation here, as NLP evaluations move towards more human-like responses. More contextual and subjective factors, such as tone and style, may also be useful signals to capture, especially by using LLMs as judges.

Finally, for the users of LLMs, we advocate choosing models that perform better on the specific tasks that are relevant to their downstream use cases, not merely those that top leaderboards based on an aggregate score. We also call for greater interactions between downstream users and leaderboard owners, enabling the former to request specific signals to be added to leaderboards to better inform industry decisions.

The world of NLP no longer knows bounds of languages, domains, and even other modalities such as speech, images, and video. Through these evaluation guidelines, our joint aspiration at AI Singapore and Cohere for AI is to enable this revolution’s gifts to be equally available to communities all over the world, and to drive innovation that truly brings out the potential of humans everywhere.

About the authors (in alphabetical order)

Adithya Venkatadri Hulagadri is an AI Engineer on AI Singapore’s SEA-LION team with a keen interest in Cognitive Science and Natural Language Processing.

Julia Kreutzer is a Research Scientist at Cohere’s research lab, Cohere for AI, focusing on advancing multilingual modeling and evaluation.

Jian Gang Ngui is a Linguist on AI Singapore’s SEA-LION team who advocates for strong community collaboration and specializes in multilingual data curation and model evaluations.

Xian Bin Yong is an AI Engineer on AI Singapore’s SEA-LION team working towards better multilingual and multicultural evaluation of LLMs.

About the contributors

We would like to express our gratitude to our colleagues at AI Singapore who have greatly contributed to enriching the discussions and generously provided their valuable and insightful feedback for this blog post. In particular, we would like to acknowledge Weiqi Leong, Jann Railey Montalan, Hamsawardhini Rengarajan, and Yosephine Susanto, each of whom made significant contributions to the development of this post. We especially thank Leslie Teo, Darius Liu, and William Tjhi for their immense support that helped this project come to fruition.

Similarly, we would like to thank our colleagues at Cohere for AI, Sara Hooker, Marzieh Fadaee, and Ahmet Üstün for helping improve the blog post, Madeline Smith for guiding us to publication, and Sally Vedros for the copy edit.

Collaborate with us

We are continuously improving SEA-HELM and expanding the languages and tasks covered. SEA-HELM is also being actively integrated into the original HELM (Holistic Evaluation of Language Models) framework. We welcome collaboration, and invite our readers to contact us at seald@aisingapore.org.

Join the Aya community, a space for multilingual AI researchers worldwide to connect, learn from one another, and work collaboratively to advance the field of ML research. We will continue to host open science initiatives.

Towards fair and comprehensive multilingual LLM benchmarking

Explore how to design fair, transparent, and representative multilingual evaluations for large language models.

Our multilingual and multicultural evaluation initiatives

Challenge 1: Multilingual, but how multilingual?

Solution: The Bender Rule for multilingual models

Challenge 2: Lost in aggregation

Solution: Transparent aggregation of task metrics

Challenge 3: Scaling authentic representation

Solution 1: A participatory approach

Solution 2: LLM judges for crowdsourced prompts

Practical recommendations

About the authors (in alphabetical order)

About the contributors

Collaborate with us

Like this:

Ready to explore SEA-LION?

Southeast Asia's Open AI Frontier.

MODELS

DEMO

COMMUNITY

RESOURCES

ABOUT

Explore how to design fair, transparent, and representative multilingual evaluations for large language models.

Our multilingual and multicultural evaluation initiatives

Challenge 1: Multilingual, but how multilingual?

Solution: The Bender Rule for multilingual models

Challenge 2: Lost in aggregation

Solution: Transparent aggregation of task metrics

Challenge 3: Scaling authentic representation

Solution 1: A participatory approach

Solution 2: LLM judges for crowdsourced prompts

Practical recommendations

About the authors (in alphabetical order)

About the contributors

Collaborate with us

Share this:

Like this:

Ready to explore SEA-LION?

Southeast Asia's Open AI Frontier.

MODELS

DEMO

COMMUNITY

RESOURCES

ABOUT