A new model from Google has quietly upended the entire AI landscape. Embedding 2 is the first natively multimodal embedding model from Google, and it's doing something that most people aren't even close to understanding: letting you embed videos directly into a vector database.
This isn't a small tweak. For years, anyone wanting to analyze video data through RAG systems had to use hacky workarounds—transcribing videos, describing them with text, then pretending they were normal documents. Now the video itself becomes searchable. The business implications are massive. Organizations with proprietary video archives suddenly have a path to actually interrogate that content at scale.
But here's what's causing the confusion: most people assume embedding a video means you can simply ask questions about it and get intelligent answers. That's not how it works. And that's why this article matters.
The Naive RAG System Problem
The standard expectation looks something like this: you find a great video, embed it using Embedding 2, then ask your AI system specific questions about the content. You imagine the LLM will analyze what's inside the video and give you a detailed response.
That's not what happens.
When you take a naive RAG system—one that simply connects Embedding 2 to an existing pipeline—and feed it a video about Playwright browser automation, then ask "How does Playright test without opening browsers?", the system doesn't respond with a text explanation. It hands you back a two-minute clip of the original video and says, "The answer is in there." That's what most tutorials are showing people to do. And while there's some value in that, it's not where real utility lives.
This is the critical misunderstanding: an embedding model takes data and turns it into a vector—a point in mathematical space. It handles similarity and search. But it's not responsible for generating explanations or analysis. It's like the difference between recognizing a face in a crowd and being able to describe who that person is, what they do, and why they matter.
How RAG Actually Works
To understand where this goes wrong, you need to understand how retrieval-augmented generation flows.
On one side of the equation: document ingestion. A text document about World War II battleships goes through an embedding model, which turns it into 1,526 numbers—a vector in space. The model places that document near other vectors with similar semantic meaning (ships, boats) rather than unrelated concepts (bananas, green apples).
On the other side: retrieval. A user's question becomes its own vector. The LLM searches the database for vectors closest to the question's vector, retrieves those documents, and ingests their content to augment its answer.
This works perfectly with text because when a document is retrieved, it's paired with actual text—the original PDF or Word file—that the system can read and respond to.
Now apply that same logic to video. You embed a video about World War II battleships. It becomes a vector in space. A user asks a question. The system retrieves that video vector. And what comes back? An MP4 file. Not analysis. Not explanation. Just the raw video data.
The LLM can't ingest what's inside the video because it's not text—it's just a video file. Unless you're using something like Gemini specifically designed to parse video, you have no real answer mechanism.
The Actual Solution
This is solvable. But it requires rethinking where the work happens.
You don't want to augment on the retrieval side—where every question triggers expensive video analysis by the LLM. That's slow and costly. Instead, you want to augment during ingestion: when you first put that video into the system, you also run it through Gemini (or a comparable model) to generate a text description and transcript.
The pipeline becomes: video goes into Embedding 2, but simultaneously gets processed by a language model to produce accompanying text. When a question retrieves that video vector, it's paired not just with an MP4 file, but with written explanation—text the AI can actually ingest and use to generate answers.
This is what makes a multimodal RAG system actually work. The architecture requires adding Gemini (or similar) on the ingestion side of the pipeline, creating a workflow where video content gets augmented once, at entry, rather than on every query.
Counterpoints Worth Considering
Critics might note that this approach still has limitations. Embedding 2 only handles videos up to 120 minutes and text up to 8,192 tokens. While workarounds exist for longer content, the model wasn't designed with heavy video workloads in mind—it's an embedding model first, not an analysis engine.
Additionally, some argue that using Gemini during ingestion adds latency and cost to the pipeline. Others suggest that pure multimodal models like Gemini itself might eventually render this intermediate step unnecessary by handling both embedding and explanation natively. The architecture described here may be a transitional solution.
Embedding models take data and turn it into vectors—they handle similarity and search, not explanation or analysis.
Bottom Line
Chase H's core argument is compelling: the ability to embed video directly into vector databases represents a genuine breakthrough in RAG architecture. The biggest misunderstanding isn't whether embedding works—it's that most people assume embedding automatically enables analysis. It doesn't. You need an augmentation pipeline during ingestion, not retrieval. The strongest part of this piece is the clear explanation of why naive implementations fail and what the actual solution looks like. Its vulnerability is that this architecture may become obsolete once multimodal models like Gemini advance further—the intermediate step might eventually disappear entirely.