← Back to Library

Understanding multimodal llms

In a field often drowned by hype, Sebastian Raschka delivers a rare moment of architectural clarity, cutting through the noise to explain exactly how machines are learning to "see" and "speak" simultaneously. While the industry chases the next benchmark score, Raschka focuses on the fundamental engineering choices that determine whether a model merely guesses at an image or truly understands its context. This is not just a tutorial; it is a necessary map for navigating the shift from text-only intelligence to systems that can process the full spectrum of human input.

The Architecture of Vision

Raschka begins by grounding the reader in the practical reality of these systems, defining them not by their marketing but by their function. "Multimodal LLMs are large language models capable of processing multiple types of inputs, where each 'modality' refers to a specific type of data—such as text... sound, images, videos, and more." This definition is crucial because it reframes the technology from a magic trick into a data processing challenge. The author highlights a specific, high-value application that resonates with any professional dealing with unstructured data: "extracting information from a PDF table and converting it into LaTeX or Markdown." This concrete example immediately signals that the piece is about utility, not just theory.

Understanding multimodal llms

The core of the analysis lies in Raschka's distinction between two dominant engineering strategies. He identifies the first as the "Unified Embedding Decoder Architecture," a method that treats visual data almost exactly like text. "In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation." This is a clever reframing of a complex problem: by forcing images into the same mathematical space as words, the system can use its existing language reasoning capabilities to interpret visuals. The elegance of this approach is its simplicity, but it relies heavily on the quality of the initial conversion.

"The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder."

Raschka explains that this conversion isn't automatic; it requires a "linear projection" layer to act as a translator. He walks the reader through the mechanics of breaking an image into patches, much like breaking a sentence into subwords, and then projecting those patches into a dimension the language model understands. "For a typical text-only LLM that processes text, the text input is usually tokenized... and then passed through an embedding layer." By drawing this parallel, he demystifies the "black box" of computer vision. However, a counterargument worth considering is that this rigid projection might lose subtle visual nuances that don't fit neatly into the language model's pre-existing token vocabulary, potentially limiting the model's ability to grasp abstract visual concepts.

The Cross-Attention Alternative

The second major approach Raschka details is the "Cross-Modality Attention Architecture," which takes a more integrated, albeit complex, path. Instead of flattening the image into a sequence of tokens, this method keeps the image data separate and uses a mechanism called "cross-attention" to let the text and image components interact directly within the model's layers. "In cross-attention, in contrast to self-attention, we have two different input sources... we mix or combine two different input sequences." This distinction is vital for understanding why some models feel more "aware" of an image than others.

Raschka traces this back to the foundational "Attention Is All You Need" paper, noting that while it was originally designed for translation, the logic applies perfectly to vision. "In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies." The author argues that this method allows for a more dynamic relationship between the visual and textual data, rather than forcing the image to conform to the text's structure. "The idea is related and goes back to the original transformer architecture... where the two inputs x1 and x2 correspond to the sequence returned by the encoder module... and the input sequence being processed by the decoder part." This architectural choice suggests that the future of multimodal AI may lie in models that can toggle between modes of thinking rather than just merging them.

"In cross-attention, the two input sequences x1 and x2 can have different numbers of elements. However, their embedding dimensions must match."

This flexibility is the strength of the cross-attention model, but it comes with a computational cost that Raschka hints at but doesn't fully dwell on. Critics might note that while this architecture is theoretically superior for complex reasoning, the training data requirements are exponentially higher, potentially slowing down the deployment of these advanced models in real-world enterprise settings. The trade-off between architectural elegance and training efficiency remains a critical bottleneck.

The Training Reality

Finally, Raschka addresses the practicalities of bringing these models to life, emphasizing that we are rarely building from scratch. "Multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model." This is a significant insight for developers and strategists: the intelligence of these new systems is largely inherited from their text-only ancestors. The image encoder, often a model like CLIP, is frequently "frozen" during training, meaning the system learns to "speak" the language of the frozen vision model rather than relearning how to see.

"For the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions."

This reliance on pre-trained components creates a fascinating dependency chain. If the vision encoder is biased or limited, the multimodal model inherits those flaws. Raschka's focus on the "freezing" of components suggests that the rapid progress we are seeing is less about inventing new ways to see and more about better ways to talk about what we already see. This framing is effective because it manages expectations; the leap in capability is an integration challenge, not a fundamental breakthrough in perception.

Bottom Line

Sebastian Raschka's piece succeeds by stripping away the mystique of multimodal AI to reveal the engineering scaffolding underneath. The strongest part of his argument is the clear dichotomy between the unified token approach and the cross-attention method, providing a mental model for readers to evaluate new releases. However, the piece's biggest vulnerability is its light touch on the ethical implications of these architectures; as models become better at interpreting images, the potential for surveillance and bias amplification grows, a risk that warrants deeper scrutiny. Readers should watch for how the industry balances the efficiency of frozen encoders against the need for more adaptable, less biased vision systems in the coming year.

Sources

Understanding multimodal llms

by Sebastian Raschka · Ahead of AI · Read full article

It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to AI and several interesting research papers published. 

Among others, Meta AI released their latest Llama 3.2 models, which include open-weight versions for the 1B and 3B large language models and two multimodal models.

In this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches.

(To see a table of contents menu, click on the stack of lines on the left-hand side.)

But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)", is now finally available on Amazon!

Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers!

For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months. 

P.S. If you have read the book, I'd really appreciate it if you could leave a brief review; it truly helps us authors!

1. Use cases of multimodal LLMs.

What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs.

A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below.

Of course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown.

2. Common approaches to building multimodal LLMs.

There are two main approaches to building multimodal LLMs:

Method A: Unified Embedding Decoder Architecture approach;

Method B: ...