Bridging Cultures and Pixels: Introducing SEA-LION-VL, a New Vision-Text Model for Southeast Asia

We are thrilled to announce the launch of Gemma-SEA-LION-v4-27B-VL, a cutting-edge, instruction-tuned vision-text model designed specifically for the unique and diverse landscape of Southeast Asia. This release is the result of a powerful collaboration between the Products Pillar at AI Singapore and SEACrowd, with funding from the Singapore National Research Foundation (NRF).

A Multimodal Model That Understands Southeast Asia

Built on the powerful Gemma 3 architecture, Gemma-SEA-LION-v4-27B-VL was developed using a deliberate, two-stage training process to balance broad instruction-following capabilities with a specialized, region-aware foundation.

A Technical Deep Dive into Our First Vision-Language Model for Southeast Asia

Let’s have a look into our development process, initial findings, and the complex challenges of building region-specific Vision-Language Models (VLMs). Our goal is to share our learnings with the community as we navigate this new frontier. For our analysis, we will compare the performance of SEA-LION-VL against its vision qualification counterparts which have not been trained specifically on SEA images: our Gemma-SEA-LION-v4 model and the original Gemma-3-IT.

Training Methodology: A Two-Stage Approach

The development of SEA-LION-VL involved a deliberate, two-stage training process built upon the Gemma 3 architecture.

Training on General SEA Images: We began with Gemma-SEA-LION-v4-IT and trained on a dataset of 10 million image-text caption pairs. Crucially, this dataset was filtered to prioritize contexts relevant to Southeast Asia, aiming to imbue the model with a foundational understanding of regional visual concepts. This stage produced an intermediary model, SEA-LION-VL-Intermediate.
Supervised Fine-Tuning (SFT): SEA-LION-VL-Intermediate was then fine-tuned on approximately 540,000 samples from the general-domain Mammoth dataset. It is important to note that this SFT dataset was not specific to Southeast Asia.

This methodology was designed to balance broad visual instruction-following capabilities with a specialized, region-aware foundation, to create Gemma-SEA-LION-v4-27B-VL.

Model in Action

1. Why is there a ribbon tied to the boat? what does it mean?

Responses

Gemma-SEA-LION-v4-27B-VL
The ribbons tied to the boat are a traditional Thai offering to the spirits of the boat and the sea, seeking protection and good luck for the vessel and its occupants. They are often tied during a blessing ceremony.
Gemma3-27B-IT
The ribbons tied to the boat are a traditional Thai practice called “Sai Sin”. They are tied to boats (and other vehicles/objects) for good luck, protection, and to ward off evil spirits. They are often offered as a blessing for safe travels.
SEA-LION-v4-Gemma-27B-IT
The ribbons tied to the boat are a traditional Thai custom called “Sai Sin.” They are tied to boats (and other vehicles/objects) for good luck, protection, and to ward off evil spirits. They are often offered as a blessing for safe travels.

Note: the bolded text indicates that the information provided is not accurate

2.What is this place? What temple in Vietnam is it in?

Responses

Gemma-SEA-LION-v4-27B-VL
This is the Temple of Literature (Văn Miếu) in Hanoi, Vietnam. Specifically, this image shows the altar dedicated to Confucius within the temple complex.
Gemma3-27B-IT
This is the Temple of the Kings (Đền Vua) within the Hùng Temple Complex (Khu di tích Đền Hùng) in Phú Thọ Province, Vietnam. It’s a significant site honoring the Hùng Kings, considered the founders of the Vietnamese nation.
SEA-LION-v4-Gemma-27B-IT
Based on the image, this is the Temple of Literature (Văn Miếu) in Hanoi, Vietnam. Specifically, it depicts the altar dedicated to Confucius within the temple complex. The statues are of the Four Pillars of Wisdom.

Note: the bolded text indicates that the information provided is not accurate

3.Can someone with nut allergies eat this?

Responses

Gemma-SEA-LION-v4-27B-VL
Potentially no. Satay sauce (the dip) often contains peanuts. It’s unsafe for someone with nut allergies without confirming the ingredients.
Gemma3-27B-IT
Potentially no. Satay sauce often contains peanuts. The dish in the image appears to be satay (grilled meat skewers) with sauce and a vegetable relish. Someone with nut allergies should avoid this unless the sauce is confirmed to be nut-free.
SEA-LION-v4-Gemma-27B-IT
Potentially no. Satay (the meat on skewers) and its peanut-based sauce often contain nuts (peanuts specifically). The salad is likely safe, but the satay is a high risk. Always confirm ingredients with the vendor.

Note: the bolded text indicates that the information provided is not accurate

Evaluation Deep Dive: A Mixed Performance Landscape

Our evaluation reveals a nuanced picture of the model’s capabilities. While challenges remain in complex reasoning tasks, we observed promising results in its core descriptive abilities.

Modest Gains in Cross-Lingual Image Captioning (XM3600)

To assess the model’s descriptive capabilities, we evaluated it on the XM3600 benchmark. This cross-lingual dataset is critical as it measures the ability to generate relevant image captions in multiple languages, including several spoken in Southeast Asia.

XM3600 Image Captioning Results

Our analysis revealed modest but notable performance improvements in Mandarin, Thai, and Vietnamese compared to the baseline models. This suggests that initial training on SEA images was successful in enhancing the model’s ability to connect regional visual concepts with the appropriate and culturally nuanced descriptions, as exhibited by our model’s responses for the Culture, Cuisine and Historical Landmark examples above.

Preserving Text Capabilities and VQA Challenges

SEA-HELM scores show small degradation on SEA language capabilities

A key goal was to avoid performance loss in existing language tasks. Encouragingly, evaluations on the SEA-HELM text-only benchmark show that SEA-VLM did not suffer significant performance degradation, retaining most of the strengths of its predecessors.

Scores for VQA tasks across languages

In contrast, we observed a performance degradation on Multiple Choice Question (MCQ) style Visual Question Answering (VQA) tasks. The VQA tasks that were tested included MARVL (ID, TA), CVQA (ID, MS, TA, TL, ZH, JV, SU), and WorldCuisines (EN, ID, TH, TL, ZH, JV, SU). This was expected, as the relatively small and general-domain SFT dataset is insufficient to build robust VQA reasoning skills from the ground up.

Learnings and The Path Forward

This experimental release has provided several key insights:

Targeted CPT is effective: Continued pre-training on in-domain, high-quality data can successfully enhance specific capabilities like regional image captioning.
SFT is the bottleneck: High-quality, diverse, and region-specific instruction tuning data for VQA is the most critical missing piece for improving reasoning.
Balancing modalities is key: We successfully added visual understanding without significantly compromising the model’s core language skills.

Our work on SEA-VLM is just beginning. The path forward involves curating and developing superior SEA-specific datasets for supervised fine-tuning and exploring more advanced techniques to deepen the model’s multimodal reasoning. We are releasing this model to the community to foster collaboration and accelerate progress in building truly multimodal AI for Southeast Asia.

Our Commitment to Responsible AI

Like any model, Gemma-SEA-LION-v4-27B-VL has its limitations. It has not been specifically aligned for safety, and it can be prone to hallucinations. We encourage developers to conduct their own safety fine-tuning and validation to ensure it is suitable for their specific use cases.

The Future is Regional

The release of Gemma-SEA-LION-v4-27B-VL marks a significant milestone in creating more inclusive and representative AI. We believe it will empower developers, researchers, and businesses across Southeast Asia and beyond to build incredible new applications that truly understand the region.

Get Started in Minutes

Integrating Gemma-SEA-LION-v4-27B-VL into your projects is simple. Using the Hugging Face Transformers library, you can get up and running with just a few lines of code:

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="aisingapore/Gemma-SEA-LION-v4-27B-VL",
    device="cuda",
    torch_dtype=torch.bfloat16
)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            # Example image of local Southeast Asian candy
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

For more information, please contact us at sealion@aisingapore.org.

Bridging Cultures and Pixels: Introducing SEA-LION-VL, a New Vision-Text Model for Southeast Asia

A Multimodal Model That Understands Southeast Asia

A Technical Deep Dive into Our First Vision-Language Model for Southeast Asia