In a landscape where tech giants often chase the largest possible models, Grab's engineering team made a counterintuitive move: they built a smaller, custom vision model from scratch to solve a problem that massive general-purpose AI couldn't crack. Alex Xu's breakdown of this journey reveals that for complex, region-specific tasks like document verification in Southeast Asia, specialization beats scale every time. This isn't just a case study in machine learning; it's a masterclass in why one-size-fits-all AI often fails when it meets the messy reality of global diversity.
The Architecture of Understanding
Xu begins by dismantling the assumption that a standard text-based Large Language Model (LLM) can simply "look" at a document. He explains that a Vision LLM requires a distinct three-part architecture: an image encoder to translate pixels into numbers, a projector to bridge the gap between vision and language, and the language model itself to generate the output. "The first component is the image encoder... Think of it as translating visual information into a structured representation of numbers and vectors," Xu writes. This distinction is crucial because it highlights that seeing and reading are fundamentally different cognitive tasks for a machine.
The core of the argument rests on the failure of existing tools. Traditional Optical Character Recognition (OCR) systems, which have been the industry standard since the 1970s, crumbled under the sheer variety of Southeast Asian document templates. Even powerful proprietary models struggled, often producing "hallucinations"—confidently incorrect outputs—when faced with local languages. Xu notes that while open-source models offered efficiency, they "lacked the accuracy required for production deployment." This gap between theoretical capability and practical utility is where the engineering team had to intervene.
"Preserving the original resolution maintains text integrity and improves accuracy."
The decision to select the Qwen2-VL 2B model as a starting point was driven by a specific technical necessity: dynamic resolution. Unlike models that force images into fixed sizes, distorting text in the process, this model could handle images in their native resolution. This choice was not merely about convenience; it was a prerequisite for accuracy. As Xu puts it, "resizing or cropping images can distort text characters, leading to recognition errors." For busy engineers, this is a vital reminder that data preprocessing is often more critical than model selection.
The Data Dilemma
The most compelling part of Xu's narrative is the realization that the model's intelligence was limited not by its brain, but by its eyes. The team discovered that while the language decoder understood Thai or Vietnamese text, the vision encoder had never learned to recognize what those characters looked like in an image. "The language model might understand Thai text, but the vision encoder had never learned to recognize what Thai characters look like in images," Xu observes. This insight drove a pivot from simple fine-tuning to a more rigorous training regimen.
To solve this, Grab didn't just scrape the web; they engineered their own reality. They created a synthetic dataset by rendering text from Common Crawl in various fonts and backgrounds, effectively teaching the model to "see" before it could "read." They also built Documint, an internal framework to auto-label real documents. This dual approach allowed them to generate unlimited variations of training data, a strategy that echoes the early days of computer vision where synthetic data was often the only way to get models to recognize rare objects.
Critics might argue that relying heavily on synthetic data introduces a "sim-to-real" gap, where the model performs well on artificial images but fails on real-world scans. However, Xu counters this by showing how the team used human reviewers to refine the auto-labeled data, ensuring high accuracy before the model ever saw it. The result was a model that could handle the chaotic reality of a crumpled, tilted, or poorly lit driver's license.
From 2 Billion to 1 Billion Parameters
The final phase of the project is where the story becomes truly distinctive. Instead of settling for the improved 2-billion-parameter model, the team decided to build a 1-billion-parameter model from scratch. They combined the best vision encoder from the larger model with a compact language decoder. "A smaller model of approximately 1 billion parameters, built from scratch and trained comprehensively, can achieve near state-of-the-art results," Xu concludes. This is a bold claim in an era obsessed with parameter counts, suggesting that efficiency is the new frontier.
The performance gains were staggering. The custom 1B model was 48% faster at median latency and 56% faster in worst-case scenarios compared to the larger model. "Grab identified that one of the biggest weaknesses of external APIs like ChatGPT or Gemini was the P99 latency, which can easily be 3 to 4 times higher than the P50 latency," Xu writes. For a service like electronic know-your-customer (eKYC) verification, where users are waiting for approval, that consistency is the difference between a seamless experience and a frustrated customer.
"Full parameter fine-tuning proved superior to LoRA for specialized, non-Latin script domains."
This finding challenges the prevailing trend of using Low-Rank Adaptation (LoRA) for all fine-tuning tasks. While LoRA is resource-efficient, Xu demonstrates that for domains with significant visual differences—like the unique scripts of Southeast Asia—updating all model parameters is necessary to capture the nuance. This is a critical lesson for any organization trying to deploy AI in non-Western contexts.
Bottom Line
Xu's analysis succeeds because it moves beyond the hype of "bigger is better" to demonstrate that context is king. The strongest part of this argument is the empirical proof that a custom, smaller model can outperform massive generalists when trained on high-quality, region-specific data. The biggest vulnerability, however, is the immense resource investment required to build such a system from scratch; not every company has the engineering bandwidth to create a synthetic data pipeline and an internal labeling framework. For the smart, busy reader, the takeaway is clear: in the race for AI adoption, the winner won't be the one with the biggest model, but the one with the most relevant data.