WangchanLION-v3: An Open-Science of Large-scale Thai Pre-training Using SEA-LION

Introduction

We are very happy to announce the release of WangchanLION-v3, with the collaboration of AI Singapore, VISTEC, and SCB10X. WangchanLION-v3 is an 8 billion parameter model, pre-trained on 47 billion high-quality Thai tokens. The 47B tokens will also be released on AI Singapore’s HuggingFace.

While recent efforts in Thai language model development have made notable progress, they vary widely in approach, from applying general-purpose, language-agnostic pipelines like in the Typhoon-2 project to building custom pipelines specifically for Thai (seen in OpenThaiGPT). However, most of these projects focus on open-sourcing models rather than making data and collection methods transparent. This lack of accessible pretraining corpora and detailed pipeline documentation poses challenges for reproducibility and further research. This brings us to a key question: What does it take to build a high-quality pretraining corpus for Thai, and how can existing pipelines be adapted to better reflect its linguistic and cultural nuances?

To address this problem, we propose an open-sourced model and its data, WangchanLION-v3. The model was trained on Thai-English pre-training data, where we filtered out the low-quality data from a total of 100 billion tokens to 47 billion tokens. Unlike other Thai pre-training models, we openly share comprehensive details, from data cleaning pipeline experiments to SFT results, to better shed light on the critical challenge of curating high-quality training data.

New Thai Pre-training Data

Here, we discuss the newly collected pre-training data, which is divided into two sets.

CC-derived Data
Curated Non-CC

CC-derived Data

We follow the same practice in the pre-training works by using the Common Crawl (CC) and FineWeb2 datasets for this subset.

Common Crawl. We collected Common Crawl datasets from CC-2018-30 to CC-2023-23, filtering for Thai content only. Text was extracted using Trafilatura, with Thai-language data identified via Common Crawl metadata. Following extraction, we applied our customized data cleaning pipeline to further clean and filter the dataset.

Fineweb2. FineWeb2 is a web dataset derived from Common Crawl, covering the period from summer 2013 to April 2024. Its Thai subset contains approximately 51.4 billion words across 35 million documents. We further processed the cleaned FineWeb2 dataset using our customized data cleaning pipeline to deduplicate overlapping URLs and texts from the Common Crawl source. The resulting cleaned Thai FineWeb2 dataset comprises around 7.3 billion words and 4.6 million documents.

Curated Non-CC Data

In addition to internet-sourced data, we curated content from non-web and domain-specific sources including Encyclopedic, Finance, Legal, Government, Education, and YouTube domains. Importantly, all collected data is released under a CC license (we have the permission from all sources/owners to release the data using this license).

New Data Cleaning Pipeline

We also propose a new data cleaning pipeline to improve and filter out the low-quality data. We adopt the data collection of Dolma by applying five major components:

Language identity: Instead of relying on FastTex as a language identifier, as in Dolma, we use a rule-based approach for Thai script, which is more efficient in terms of performance and speed.
Quality Filters: For this step, we follow the original Dolma approach by applying C4 and Gopher filtering rules. However, we adapted these rules to better suit the Thai language, making adjustments based on our investigations to improve compatibility and effectiveness.
Deduplication by URL: We use the Bloom filter to remove duplicate data.
Content Filters: We also enhanced the content filter to more effectively detect and remove not-safe-for-work (NSFW) content, personal contact information, and gambling-related material in Thai, improving upon the capabilities of existing filters.

Key Finding

Data Quality Assessment

Before and After: Our pipeline significantly reduced dataset size while improving quality. For Common Crawl, we filtered the dataset from 202 million documents down to 25.1 million, resulting in cleaner text compared to the raw data. Similarly, we reduced the size of FineWeb2 by half. Most of the removed content fell under the Gopher’s Rule/C4 filters, and gambling-related material, categories typically associated with low quality data that do not contribute to downstream task performance.

Full Results:

Common Crawl: While the CC subset achieves better overall normalised scores in our data pipeline, its NLU score remains at zero, as it falls below the SEA-HELM baseline. Conversely, the experimental results from the SEACrowd evaluation demonstrate that our pipeline has superior performance in NLG tasks as compared to the original datasets. This implies that our benchmark can filter out low-quality data from uncleaned data, resulting in better performance in downstream tasks.
FineWeb2: Our benchmark gets better normalised scores than FineWeb2 original in SEA-HELM evaluation, as shown in the above Table. However, we can significantly reduce its size using our pipeline from 35.9 million documents to 17.1 million documents. Additionally, we found that prior to applying our data pipeline, FineWeb2 contained a significant amount of unwanted content, including low-quality text, duplicate URLs and overlapping text, adult content, gambling material, and personally identifiable information (PII). In addition, this also implies that the cleaning data pipeline from FineWeb 2 might not be applicable or suitable for Thai texts.

How Do We Formulate a Pre-training Model From Our Data?

After finalizing the optimal quality control settings, we perform continuous pretraining (CPT) on our training corpora using Llama-SEA-LION-v3-8B-IT. The training configuration is as follows:

max_seq_len: 8192
learning rate: 5.0e-6
optimizer: decoupled_lionw
lr_scheduler_type: cosine_with_warmup
num_train_epochs: 1
GPU: H100 (64 GPUs)
Time: 1d 12h 24m

Following training, we perform supervised fine-tuning (SFT) using several Thai instruction datasets, including Wangchan-FLAN-6M, Wangchan-60k, Seed5k, and WangchanInstruction. We apply QLoRA to compare the base model with our CPT-enhanced model. Additionally, we use the same SFT setup on other base LLMs such as Typhoon, SEA-LION, and Llama.

Downstream Task Results

SEA-HELM

Discussion: Our model achieves the best overall performance compared to the other three models, consistently outperforming our base model, Llama-SEA-LION-v3-8B-IT, across all benchmarks. Although it falls short of Typhoon v2 in one instance, this is in the safety evaluation task, which involves classifying toxic prompts – a task known for its inconsistency and difficulty, with no model achieving a high score.

MT Bench: We found that, when using our base model, the performance of the Knowledge III (cultural evaluation) category is higher than other models. This emphasizes the importance of using our data, which can also yield improvement in terms of Thai cultural knowledge. In addition, we also achieve improvements in the Roleplay and Reasoning categories.

Thai LLM Leaderboard

Discussion: As shown in the table, our model improves performance on the NLG tasks across all cases. We achieve an average score of 54.84 points, outperforming our base CPT model, SEA-LION, by 4.48 points. However, we observe that NLU performance is lower than that of SEA-LION and Typhoon2. This is because our optimization focuses on generation tasks, and adding more Thai data primarily enhances free-form generation rather than NLU tasks. As a result, there is a trade-off between language fluency and general knowledge. Notably, since the NLU datasets are not specific to Thai culture, unlike the Thai MT-bench, the model trained exclusively on Thai data may underperform on world-knowledge benchmarks such as XCOPA, XNLI, and Belebele.

Conclusion

To help close the gap in Thai pretraining resources and promote greater inclusion of Thai in the open-source NLP community, we introduce WangchanLION-v3, a large-scale, open-source Thai pretraining dataset comprising 47.4 billion tokens. We developed a Thai-specific data cleaning pipeline, validated through an ablation study to ensure each step’s effectiveness. By customizing the Dolma pipeline, we systematically processed Thai Common Crawl data with tailored filters for language identification, quality control, and harmful content removal. To further enhance data diversity, we incorporated additional sources such as Wikipedia, YouTube subtitles, and OCR-extracted texts from open-access books.

Resources

Pre-training data (web): https://huggingface.co/datasets/aisingapore/WangchanLION-Web

Pre-training data (curated): https://huggingface.co/datasets/aisingapore/WangchanLION-Curated

Pre-training model: https://huggingface.co/aisingapore/WangchanLION-v3

SFT model: https://huggingface.co/aisingapore/WangchanLION-v3-IT

Github: https://github.com/vistec-AI/Mangosteen

Release of WangchanLION v3

Introduction