← Back to Library

How grab built a vision LLM to scan images

In a landscape where tech giants often chase the largest possible models, Grab's engineering team made a counterintuitive move: they built a smaller, custom vision model from scratch to solve a problem that massive general-purpose AI couldn't crack. Alex Xu's breakdown of this journey reveals that for complex, region-specific tasks like document verification in Southeast Asia, specialization beats scale every time. This isn't just a case study in machine learning; it's a masterclass in why one-size-fits-all AI often fails when it meets the messy reality of global diversity.

The Architecture of Understanding

Xu begins by dismantling the assumption that a standard text-based Large Language Model (LLM) can simply "look" at a document. He explains that a Vision LLM requires a distinct three-part architecture: an image encoder to translate pixels into numbers, a projector to bridge the gap between vision and language, and the language model itself to generate the output. "The first component is the image encoder... Think of it as translating visual information into a structured representation of numbers and vectors," Xu writes. This distinction is crucial because it highlights that seeing and reading are fundamentally different cognitive tasks for a machine.

How grab built a vision LLM to scan images

The core of the argument rests on the failure of existing tools. Traditional Optical Character Recognition (OCR) systems, which have been the industry standard since the 1970s, crumbled under the sheer variety of Southeast Asian document templates. Even powerful proprietary models struggled, often producing "hallucinations"—confidently incorrect outputs—when faced with local languages. Xu notes that while open-source models offered efficiency, they "lacked the accuracy required for production deployment." This gap between theoretical capability and practical utility is where the engineering team had to intervene.

"Preserving the original resolution maintains text integrity and improves accuracy."

The decision to select the Qwen2-VL 2B model as a starting point was driven by a specific technical necessity: dynamic resolution. Unlike models that force images into fixed sizes, distorting text in the process, this model could handle images in their native resolution. This choice was not merely about convenience; it was a prerequisite for accuracy. As Xu puts it, "resizing or cropping images can distort text characters, leading to recognition errors." For busy engineers, this is a vital reminder that data preprocessing is often more critical than model selection.

The Data Dilemma

The most compelling part of Xu's narrative is the realization that the model's intelligence was limited not by its brain, but by its eyes. The team discovered that while the language decoder understood Thai or Vietnamese text, the vision encoder had never learned to recognize what those characters looked like in an image. "The language model might understand Thai text, but the vision encoder had never learned to recognize what Thai characters look like in images," Xu observes. This insight drove a pivot from simple fine-tuning to a more rigorous training regimen.

To solve this, Grab didn't just scrape the web; they engineered their own reality. They created a synthetic dataset by rendering text from Common Crawl in various fonts and backgrounds, effectively teaching the model to "see" before it could "read." They also built Documint, an internal framework to auto-label real documents. This dual approach allowed them to generate unlimited variations of training data, a strategy that echoes the early days of computer vision where synthetic data was often the only way to get models to recognize rare objects.

Critics might argue that relying heavily on synthetic data introduces a "sim-to-real" gap, where the model performs well on artificial images but fails on real-world scans. However, Xu counters this by showing how the team used human reviewers to refine the auto-labeled data, ensuring high accuracy before the model ever saw it. The result was a model that could handle the chaotic reality of a crumpled, tilted, or poorly lit driver's license.

From 2 Billion to 1 Billion Parameters

The final phase of the project is where the story becomes truly distinctive. Instead of settling for the improved 2-billion-parameter model, the team decided to build a 1-billion-parameter model from scratch. They combined the best vision encoder from the larger model with a compact language decoder. "A smaller model of approximately 1 billion parameters, built from scratch and trained comprehensively, can achieve near state-of-the-art results," Xu concludes. This is a bold claim in an era obsessed with parameter counts, suggesting that efficiency is the new frontier.

The performance gains were staggering. The custom 1B model was 48% faster at median latency and 56% faster in worst-case scenarios compared to the larger model. "Grab identified that one of the biggest weaknesses of external APIs like ChatGPT or Gemini was the P99 latency, which can easily be 3 to 4 times higher than the P50 latency," Xu writes. For a service like electronic know-your-customer (eKYC) verification, where users are waiting for approval, that consistency is the difference between a seamless experience and a frustrated customer.

"Full parameter fine-tuning proved superior to LoRA for specialized, non-Latin script domains."

This finding challenges the prevailing trend of using Low-Rank Adaptation (LoRA) for all fine-tuning tasks. While LoRA is resource-efficient, Xu demonstrates that for domains with significant visual differences—like the unique scripts of Southeast Asia—updating all model parameters is necessary to capture the nuance. This is a critical lesson for any organization trying to deploy AI in non-Western contexts.

Bottom Line

Xu's analysis succeeds because it moves beyond the hype of "bigger is better" to demonstrate that context is king. The strongest part of this argument is the empirical proof that a custom, smaller model can outperform massive generalists when trained on high-quality, region-specific data. The biggest vulnerability, however, is the immense resource investment required to build such a system from scratch; not every company has the engineering bandwidth to create a synthetic data pipeline and an internal labeling framework. For the smart, busy reader, the takeaway is clear: in the race for AI adoption, the winner won't be the one with the biggest model, but the one with the most relevant data.

Deep Dives

Explore these related deep dives:

Sources

How grab built a vision LLM to scan images

Kubernetes Quick-Start Guide (Sponsored).

Cut through the noise with this engineer-friendly guide to Kubernetes observability. Save this reference for fast-track access to essential kubectl commands and critical metrics, from disk I/O and network latency to real-time cluster events. Perfect for scaling, debugging, and tuning your workloads without sifting through endless docs.

Digital services require accurate extraction of information from user-submitted documents such as identification cards, driver’s licenses, and vehicle registration certificates. This process is essential for electronic know-your-customer (eKYC) verification. However, the diversity of languages and document formats across the region makes this task particularly challenging.

Grab Engineering Team faced significant obstacles with traditional Optical Character Recognition (OCR) systems, which struggled to handle the variety of document templates. While powerful proprietary Large Language Models (LLMs) were available, they often failed to adequately understand Southeast Asian languages, produced errors and hallucinations, and suffered from high latency. Open-source Vision LLMs offered better efficiency but lacked the accuracy required for production deployment.

This situation prompted Grab to fine-tune existing models and eventually build a lightweight, specialized Vision LLM from the ground up. In this article, we will look at the complete architecture, the technical decisions made, and the results achieved.

Disclaimer: This post is based on publicly shared details from the Grab Engineering Team. Please comment if you notice any inaccuracies.

Understanding Vision LLMs.

Before diving into the solution, it helps to understand what a Vision LLM is and how it differs from traditional text-based language models.

A standard LLM processes text inputs and generates text outputs. A Vision LLM extends this capability by enabling the model to understand and process images. The architecture consists of three essential components working together:

The first component is the image encoder. This module processes an image and converts it into a numerical format that computers can work with. Think of it as translating visual information into a structured representation of numbers and vectors.

The second component is the vision-language projector. This acts as a bridge between the image encoder and the language model. It transforms the numerical representation of the image into a format that the language model can interpret and use alongside text inputs.

The third component is the language model itself. This is the familiar text-processing model that takes both the transformed image information and any text instructions to generate a final text output. In the case of document processing, this output would be ...